AlexeyAB / darknet

YOLOv4 / Scaled-YOLOv4 / YOLO - Neural Networks for Object Detection (Windows and Linux version of Darknet )
http://pjreddie.com/darknet/
Other
21.77k stars 7.96k forks source link

Different results between CPU and GPU (Convolutional layer has problem when dilation > 1 and CUDNN is not active) #4836

Open mrhosseini opened 4 years ago

mrhosseini commented 4 years ago

Hi,

Using this network.cfg.txt, the detection results of the CUDA enabled build of this repository are different from the build without enabling CUDA. The CUDA enabled build produces better results. Using Tiny YoloV3, there is no difference.

AlexeyAB commented 4 years ago

What params did you use in the Makefile in both cases?

mrhosseini commented 4 years ago

Without CUDA:

GPU=0
CUDNN=0
CUDNN_HALF=0
OPENCV=0
AVX=0
OPENMP=0
LIBSO=0
ZED_CAMERA=0

CUDA:

GPU=1
CUDNN=1
CUDNN_HALF=1
OPENCV=1
AVX=0
OPENMP=0
LIBSO=1
ZED_CAMERA=0
mrhosseini commented 4 years ago

@AlexeyAB

Could you please guide me with the directions to solve this problem? I am not sure from where to start or what to check?

AlexeyAB commented 4 years ago

Different results between CPU and GPU

Show the difference.

May be there is bug in [reorg3d] since it isn't well tested. I think I should rewrite this layer from the scratch. Try to find the error, or try to avoid reorg3d, does this help?

mrhosseini commented 4 years ago

@AlexeyAB Thanks for your help. I tested the [reorg3d] layer independently with some manual input, but I didn't see any difference between the output of CPU and GPU in this layer. Next I added some code to forward_network() and forward_network_gpu() to calculate the total sum of output elements in each layer and compared them. Here is an example for my .cfg file:

sum[%layer] CPU GPU CPU – GPU
sum[0] 613456.688756 613456.68694 0.001816
sum[1] 351479.923 351487.230334 -7.307334
sum[2] 191303.358827 191302.797801 0.561026
sum[3] 393020.180122 393037.718248 -17.538126
sum[4] 199410.899027 199409.187487 1.711540
sum[5] 105360.992388 105360.366701 0.625687
sum[6] 205008.054257 205008.69806 -0.643803
sum[7] 94430.198173 94430.778119 -0.579946
sum[8] 205949.192374 205946.756022 2.436352
sum[9] 97521.177134 97517.417314 3.759820
sum[10] 206803.803461 213020.145404 -6216.341943
sum[11] 97317.056703 110235.342102 -12918.285399
sum[12] 194838.233836 207752.759416 -12914.525580
sum[13] 89708.116007 99181.952804 -9473.836797
sum[14] 201686.877354 228163.717362 -26476.840008
sum[15] 92507.410713 105634.985002 -13127.574289
sum[16] 182215.526719 204816.937806 -22601.411087
sum[17] 103781.770411 103379.089134 402.681277
sum[18] 187229.29314 196699.370117 -9470.076977
sum[19] 87579.457722 91369.179553 -3789.721831
sum[20] 191361.228132 194748.268687 -3387.040555
sum[21] 100206.415242 104301.907502 -4095.492260
sum[22] 50600.791245 52192.28841 -1591.497165
sum[23] 109251.02559 111743.667888 -2492.642298
sum[24] 64515.661368 61162.404301 3353.257067
sum[25] 143081.962076 127353.707183 15728.254893
sum[26] 205949.192374 205946.756022 2.436352
sum[27] 44364.052057 44367.559203 -3.507146
sum[28] 44364.052057 44367.559203 -3.507146
sum[29] 187446.014133 171721.266386 15724.747747
sum[30] 121096.92386 120751.287471 345.636389
sum[31] 293955.45623 287118.418319 6837.037911
sum[32] -137878.726125 -132878.892983 -4999.833142
sum[33] 11432.234913 12133.453549 -701.218636
sum[34] 351479.923 351487.230334 -7.307334
sum[35] 168728.649886 168732.069919 -3.420033
sum[36] 196935.044232 196928.892621 6.151611
sum[37] 97969.597906 97965.898785 3.699121
sum[38] 97740.764431 97741.244955 -0.480524
sum[39] 54131.199064 54130.449057 0.750007
sum[40] 41354.888962 41360.449196 -5.560234
sum[41] 27076.740513 27078.459734 -1.719221
sum[42] 28407.017031 28405.41383 1.603201
sum[43] 64281.085072 64278.722779 2.362293
sum[44] -6049.092757 -6048.547794 -0.544963
sum[45] 1011.51904 1011.534915 -0.015875

As it can be seen there is a significant difference between output of CPU and GPU in the 10th layer. According to the .cfg file the 10th layer is a convolutional layer with dilated convolution. I did the same test with tiny yolov3 (which does not have any dilated convolution) and output difference between CPU and GPU for all the layers was smaller than 16. Therefore the problem must be in calculation of dilated convolution. Currently I am trying to find and solve it but still didn't get any success yet.

mrhosseini commented 4 years ago

@AlexeyAB Well, I think that I have found the problem, but I don't know how to fix it. Checking forward_convolutional_layer() and forward_convolutional_layer_gpu() I noticed that the functionality is similar except when CUDNN is active. I recompiled the code with GPU = 1 but CUDNN = 0 and CUDNN_HALF = 0 and did the above test. Without CUDNN, the results of CPU and GPU are almost similar. The problem does not exists when dilation = 1 in a convolutional layer. When CUDNN is active cudnnSetConvolution2dDescriptor() and cudnnConvolutionForward() are used. When not, the only function that uses the dilation is im2col_cpu_ext() for CPU or im2col_gpu_ext() for GPU. I guess that the implementation of dilation in these two functions (which are both similar and based on the BVLC / caffe) is different from what is implemented in CUDNN. But I don't know how to make them similar to that.

I should note that I have a weight file which is obtained from training with CUDNN. With this weight file some objects can be detected on sample images when CUDNN is active. However, when CUDNN is not active, nothing is detected on the same images.

mrhosseini commented 4 years ago

It seems that the implementation of convolutional layer when dilation > 1 and CUDNN is not active is not correct. Here is the result of training the network using GPU with and without enabling CUDNN:

GPU = 1, CUDNN = 1, CUDNN_HALF = 1 chart_dilation_cuDNN


GPU = 1, CUDNN = 0, CUDNN_HALF = 0 chart_dilation_gpu

AlexeyAB commented 4 years ago

Do you mean that for dilation=2 ?

  1. GPU=0 CUDNN=0 - not working properly
  2. GPU=1 CUDNN=0 - not working properly
  3. GPU=1 CUDNN=1 - working properly
mrhosseini commented 4 years ago

Do you mean that for dilation=2 ?

1. `GPU=0 CUDNN=0` - not working properly

2. `GPU=1 CUDNN=0` - not working properly

3. `GPU=1 CUDNN=1` - working properly

Yes, exactly.

mrhosseini commented 4 years ago

Finally I was able to fix the problem.

When CUDNN is active cudnnSetConvolution2dDescriptor() is called in convolutional_layer.c. The second and third arguments of this function are pad_h and pad_w:

cudnnStatus_t cudnnSetConvolution2dDescriptor(
    cudnnConvolutionDescriptor_t    convDesc,
    int                             pad_h,
    int                             pad_w,
    int                             u,
    int                             v,
    int                             dilation_h,
    int                             dilation_w,
    cudnnConvolutionMode_t          mode,
    cudnnDataType_t                 computeType)

and the function is called in convolutional_layer.c like this:

cudnnSetConvolution2dDescriptor(l->convDesc, l->pad * l->dilation, l->pad * l->dilation, l->stride_y, l->stride_x, l->dilation, l->dilation, CUDNN_CROSS_CORRELATION, CUDNN_DATA_FLOAT)

On the other side when CUDNN is not active, For CPU im2col_cpu_ext() is called in convolutional_layer.c and for GPU without CUDNN im2col_gpu_ext() is called in convolutional_kernels.cu in a similar way.

im2col_cpu_ext(im,          // input
                    l.c / l.groups,         // input channels
                    l.h, l.w,               // input size (h, w)
                    l.size, l.size,         // kernel size (h, w)
                    l.pad, l.pad,           // padding (h, w)
                    l.stride_y, l.stride_x,     // stride (h, w)
                    l.dilation, l.dilation, // dilation (h, w)
                    state.workspace);       // output

As it can be seen the padding arguments of cudnnSetConvolution2dDescriptor() are set to l->pad * l->dilation while the padding for im2col_cpu_ext() and im2col_gpu_ext() are set to l.pad.

As a result, changing the padding arguments of im2col_cpu_ext() and im2col_gpu_ext() from l.pad to l.pad * l.dilation solved the problem and now the result of CPU, GPU without CUDNN and GPU with CUDNN are the same.

However, I don't know why the padding arguments of cudnnSetConvolution2dDescriptor() are l->pad * l->dilation and which of l->pad * l->dilation or l->pad are correct.

Also I'm not sure if similar change should be done on im2col_cpu_ext() and col2im_cpu_ext() in backward_convolutional_layer().

I will issue a pull request if I ensure that the changes are correct. Thanks for your help @AlexeyAB.

AlexeyAB commented 4 years ago

As a result, changing the padding arguments of im2col_cpu_ext() and im2col_gpu_ext() from l.pad to l.pad * l.dilation solved the problem and now the result of CPU, GPU without CUDNN and GPU with CUDNN are the same.

Thanks! l->pad * l->dilation is correct.

Also I'm not sure if similar change should be done on im2col_cpu_ext() and col2im_cpu_ext() in backward_convolutional_layer().

Yes, there should be used l->pad * l->dilation too.

You can make a Pull Request with these changes to this repo. Or I will fix it a bit later.