doonny / PipeCNN

An OpenCL-based FPGA Accelerator for Convolutional Neural Networks
Apache License 2.0
1.26k stars 369 forks source link

tiny-YOLO Implementation #61

Open META-DREAMER opened 6 years ago

META-DREAMER commented 6 years ago

I am working on implementing tiny-YOLO using PipeCNN, was just looking for some advice and guidance for the best way to do it and the steps I should take.

I'm going to convert the tiny-YOLO weights file from darknet -> caffe and then use MATLAB fixed point toolbox to convert that to fixed-point weights that PipeCNN will use.

For updating the layer_config, how should I do this? What exactly is the format of the layer_config?

I will also be updating main.cpp to work with webcam feed.

Is there anything else here that I missed? What else will I need to do to get tiny-YOLO running?

SmartRoof commented 6 years ago

I have also working on implementing tiny-yolo voc on my de1soc.

META-DREAMER commented 6 years ago

@SmartRoof Have you made any progress?

zhao-lun commented 6 years ago

@hammadj we have it running, but its slower than we expected.

META-DREAMER commented 6 years ago

@johnnydept How did you setup the layer_config.h ? And what did you do for your weights file? I converted the tiny-yolo-voc.(cfg/weights) to caffecaffemodel and prototxt files, and then took that and merged the batch-norm layers into the conv layers and then finally used the Matlab script to convert the result into a weights.dat file. Did you do the same? Also, how is the performance and what board are you using?

META-DREAMER commented 6 years ago

Here is where I am so far for the layer_config. Does this look okay?

// TINY YOLO CONFIGURATION
unsigned layer_config[][NUM_CONFIG_ITEM] = {
    { // Layer1
        // layer_type (conv = 0, fc = 1)
        0, 
        //data_w, data_h, data_n, weight_w, weight_h, weight_n, weight_m, bias_size
        416, 416, 3, 3, 3, 3, 16, 16,
        // memrd_src (0-> data_buf, 1-> output_buf)
        0,
        // conv_x, conv_y, conv_z, conv_stride, conv_padding, conv_split, conv_relu
        416, 416, 16, 1, 1, 1, 1,
        // pool_on, pool_x, pool_y, pool_z, pool_size, pool_stride,
        1, 208, 208, 16, 2, 2,
        // lrn control (on = 1, off = 0)
        0,
        // memwr_dst (0-> data_buf, 1-> output_buf  "2"
        1
    },
    { // Layer 2
        0,
        208, 208, 16, 3, 3, 8, 32, 32,
        1,
        208, 208, 32, 1, 1, 1, 1,
        1, 104, 104, 32, 2, 2,
        0,
        0
    },
    { // Layer 3
        0,
        104, 104, 32, 3, 3, 8, 64, 64,
        0,
        104, 104, 64, 1, 1, 1, 1,
        1, 52, 52, 64, 2, 2,
        0,
        1
    },
    { // Layer 4
        0,
        52, 52, 64, 3, 3, 8, 128, 128,
        1,
        52, 52, 128, 1, 1, 1, 1,
        1, 26, 26, 128, 2, 2,
        0,
        0
    },
    { // Layer 5
        0,
        26, 26, 128, 3, 3, 8, 256, 256,
        0,
        26, 26, 256, 1, 1, 1, 1,
        1, 13, 13, 256, 2, 2,
        0,
        1
    },
    { // Layer 6
        0,
        13, 13, 256, 3, 3, 8, 512, 512,
        1,
        13, 13, 512, 1, 1, 1, 1,
        1, 13, 13, 512, 2, 1,
        0,
        0
    },
    { // Layer 7
        0,
        13, 13, 512, 3, 3, 8, 1024, 1024,
        0,
        13, 13, 1024, 1, 1, 1, 1,
        0, 13, 13, 1024, 2, 1,
        0,
        1
    },
    { // Layer 8
        0,
        13, 13, 1024, 3, 3, 8, 1024, 1024,
        1,
        13, 13, 1024, 1, 1, 1, 1,
        0, 13, 13, 1024, 2, 1,
        0,
        0
    },
    { // Layer 9
        0,
        13, 13, 1024, 1, 1, 8, 125, 125,
        0,
        13, 13, 125, 1, 0, 1, 0,
        0, 13, 13, 125, 2, 1,
        0,
        1
    },
};

signed char precision_config[][3] ={
    {8,  0, -4},//Layer-1
    { 8,  0, -2},//Layer-2
    { 8,  0, -1},//Layer-3
    { 8, -1, -1},//Layer-4
    { 8, -1, -1},//Layer-5
    {8, -1,  0},//Layer-6
    {8,  0,  2},//Layer-7
    {8,  2,  2},//Layer-8
    {8,  2,  2}//Layer-9
};

unsigned input_config[4] = {416, 416, 3, 1}; //original image size(dim1, dim2, dim3), batch size

unsigned output_config[3] = {13, 13, 125};//Layer-8  Note: only one result is extracted and verified

I've been getting errors about the pooling on layer 6 (Error: incorrect setting of pooling input/output size for layer-6!!!). If I disable pooling on layer 6 it start running, but then hangs while Launching kernel MemWr with local size.... I am testing this in sw_emu btw.

Here's my setup in main.cpp:

#define IMAGE_FILE_SIZE   (416*416*3)
#define WEIGHTS_FILE_SIZE 15730592
#define LAYER_NUM         9
#define CONV_NUM          9
const char *weight_file_path = "./data/yolo/weights.dat";
const char *input_file_path = "./data/yolo/dog.dat";

And here is my weights file, the caffe model, the matlab script to generate weights, as well as the input file: tiny-yolo-config.zip

@doonny Do you have any idea where I could be going wrong?

zhao-lun commented 6 years ago

Tiny-yolo uses SAME padding in max pool, meaning stride 1 in layer 6 will output a similiar size as input. For that, i think u have to manually add some padding. https://stackoverflow.com/a/48393040/1558037

META-DREAMER commented 6 years ago

@johnnydept Im still having troubles getting it to run. Can you share your layer_config/weights you used?

zhao-lun commented 6 years ago

@hammadj The thing is I uses floating point implementation, though, fixed point is my next work plan or using coco dataset. We have it running on de1soc at 8s/image

META-DREAMER commented 6 years ago

@johnnydept Do you know what could be causing a hang? Its stuck on the clWaitForEvents call on layer 1. So padding on layer 6 shouldnt even matter at this point since its only layer 1.

META-DREAMER commented 6 years ago

Debugged a bit more, found the place where it is hanging, its in the memWrite function in conv_pipe.cl, the line that says output = read_channel_intel(pool_ch);. It hangs when (x=112, y=61) for some reason.

@aazz44ss Would you have any idea whats wrong here?

META-DREAMER commented 6 years ago

@johnnydept @doonny Ok so I finally got tinyYOLO running, the problem was that the uchar type used in many places doesn't support values higher than 256, so I switched those out for ushort and its running now.

However, the output I get is not as expected. I've attached the result dump here: result_dump.txt.

I feel like its because I need to setup the precision config properly for tinyYOLO. Any idea on what the proper precision_config should be for tinyYOLO? Do I need to change anything with how I am converting the weights? This is my matlab script for converting weights right now:

caffe.set_mode_cpu();

model = './caffe/tiny-yolo-nobn.prototxt';
weights = './caffe/tiny-yolo-nobn.caffemodel';

net = caffe.Net(model, weights, 'test');
netparams = {{net.params('conv1',1).get_data(),net.params('conv1',2).get_data()}, ...
            {net.params('conv2',1).get_data(),net.params('conv2',2).get_data()}, ...
            {net.params('conv3',1).get_data(),net.params('conv3',2).get_data()}, ...
            {net.params('conv4',1).get_data(),net.params('conv4',2).get_data()}, ...
            {net.params('conv5',1).get_data(),net.params('conv5',2).get_data()}, ...
            {net.params('conv6',1).get_data(),net.params('conv6',2).get_data()}, ...
            {net.params('conv7',1).get_data(),net.params('conv7',2).get_data()}, ...
            {net.params('conv8',1).get_data(),net.params('conv8',2).get_data()}, ...
            {net.params('conv9',1).get_data(),net.params('conv9',2).get_data()}};

WeightWidth    = [ 8;  8;  8;  8;  8;  8;  8;  8; 8];
WeightFrac     = [ 8;  8;  8;  8;  8;  8;  8;  8; 8];

MathType   = fimath('RoundingMethod', 'Nearest', 'OverflowAction', 'Saturate', 'ProductMode', 'FullPrecision', 'SumMode', 'FullPrecision');

for i=1:9
    WeightType{i}  = numerictype('Signed',1, 'WordLength', WeightWidth(i), 'FractionLength', WeightFrac(i));
    weight{i}  = fi(netparams{i}{1}, WeightType{i}, MathType);
    bias{i}    = fi(netparams{i}{2}, WeightType{i}, MathType);
end

fid = fopen('weights.dat', 'w');
for i=1:9
    fwrite(fid, storedInteger(weight{i}), 'int8');
    fwrite(fid, storedInteger(bias{i}), 'int8');
end
fclose(fid);
myih commented 6 years ago

@hammadj What do you mean by "merged the batch-norm layers into the conv layers", did you write a kernel that do BN after convolution? If so, are you willing to share the code? Thank you.

Thilanka97 commented 6 years ago

Heyy @hammadj @johnnydept @doonny I am also thinking of implementing yolov2 or tiny yolo on fpga using opencl. I am thinking of using Pipecnn as reference. To do this, which files to I need to change for this to work for yolo? I need to write the kernel files, layer config file and host files according to yolo ryt ? Is that all that I need to change ?

It would be a great help if you could help me with this cause I am still new to opencl. Thanks in advance!

sinaasadiyan commented 5 years ago

@hammadj What do you mean by "merged the batch-norm layers into the conv layers", did you write a kernel that do BN after convolution? If so, are you willing to share the code? Thank you.

it refers to fused batch norm layer. When training is finished , the graph is frozen you can calculate normalization mean and etc and fuse them to weights. and after that there is no need to do batch norm in your inference.

For more information you can use tf lite quantization discripsion.

tirumalnaidu commented 4 years ago

I am working on implementing tiny-YOLO using PipeCNN, was just looking for some advice and guidance for the best way to do it and the steps I should take.

I'm going to convert the tiny-YOLO weights file from darknet -> caffe and then use MATLAB fixed point toolbox to convert that to fixed-point weights that PipeCNN will use.

For updating the layer_config, how should I do this? What exactly is the format of the layer_config?

I will also be updating main.cpp to work with webcam feed.

Is there anything else here that I missed? What else will I need to do to get tiny-YOLO running?

We recently developed a CNN accelerator for darknet reference model which could be helpful for you to implement the tiny yolo. We used DE10 Nano based on Intel Cyclone V SoC FPGA for the implementation. You can check out the entire design flow to implement the accelerator and the relevant codes in this repository: Link