Different results between CPU and GPU (Convolutional layer has problem when dilation > 1 and CUDNN is not active)

mrhosseini commented 4 years ago

Hi,

Using this network.cfg.txt, the detection results of the CUDA enabled build of this repository are different from the build without enabling CUDA. The CUDA enabled build produces better results. Using Tiny YoloV3, there is no difference.

AlexeyAB commented 4 years ago

What params did you use in the Makefile in both cases?

mrhosseini commented 4 years ago

Without CUDA:

GPU=0
CUDNN=0
CUDNN_HALF=0
OPENCV=0
AVX=0
OPENMP=0
LIBSO=0
ZED_CAMERA=0

CUDA:

GPU=1
CUDNN=1
CUDNN_HALF=1
OPENCV=1
AVX=0
OPENMP=0
LIBSO=1
ZED_CAMERA=0

mrhosseini commented 4 years ago

@AlexeyAB

Could you please guide me with the directions to solve this problem? I am not sure from where to start or what to check?

AlexeyAB commented 4 years ago

Different results between CPU and GPU

Show the difference.

May be there is bug in [reorg3d] since it isn't well tested. I think I should rewrite this layer from the scratch. Try to find the error, or try to avoid reorg3d, does this help?

mrhosseini commented 4 years ago

@AlexeyAB Thanks for your help. I tested the [reorg3d] layer independently with some manual input, but I didn't see any difference between the output of CPU and GPU in this layer. Next I added some code to forward_network() and forward_network_gpu() to calculate the total sum of output elements in each layer and compared them. Here is an example for my .cfg file:

sum[%layer]	CPU	GPU	CPU – GPU
sum[0]	613456.688756	613456.68694	0.001816
sum[1]	351479.923	351487.230334	-7.307334
sum[2]	191303.358827	191302.797801	0.561026
sum[3]	393020.180122	393037.718248	-17.538126
sum[4]	199410.899027	199409.187487	1.711540
sum[5]	105360.992388	105360.366701	0.625687
sum[6]	205008.054257	205008.69806	-0.643803
sum[7]	94430.198173	94430.778119	-0.579946
sum[8]	205949.192374	205946.756022	2.436352
sum[9]	97521.177134	97517.417314	3.759820
sum[10]	206803.803461	213020.145404	-6216.341943
sum[11]	97317.056703	110235.342102	-12918.285399
sum[12]	194838.233836	207752.759416	-12914.525580
sum[13]	89708.116007	99181.952804	-9473.836797
sum[14]	201686.877354	228163.717362	-26476.840008
sum[15]	92507.410713	105634.985002	-13127.574289
sum[16]	182215.526719	204816.937806	-22601.411087
sum[17]	103781.770411	103379.089134	402.681277
sum[18]	187229.29314	196699.370117	-9470.076977
sum[19]	87579.457722	91369.179553	-3789.721831
sum[20]	191361.228132	194748.268687	-3387.040555
sum[21]	100206.415242	104301.907502	-4095.492260
sum[22]	50600.791245	52192.28841	-1591.497165
sum[23]	109251.02559	111743.667888	-2492.642298
sum[24]	64515.661368	61162.404301	3353.257067
sum[25]	143081.962076	127353.707183	15728.254893
sum[26]	205949.192374	205946.756022	2.436352
sum[27]	44364.052057	44367.559203	-3.507146
sum[28]	44364.052057	44367.559203	-3.507146
sum[29]	187446.014133	171721.266386	15724.747747
sum[30]	121096.92386	120751.287471	345.636389
sum[31]	293955.45623	287118.418319	6837.037911
sum[32]	-137878.726125	-132878.892983	-4999.833142
sum[33]	11432.234913	12133.453549	-701.218636
sum[34]	351479.923	351487.230334	-7.307334
sum[35]	168728.649886	168732.069919	-3.420033
sum[36]	196935.044232	196928.892621	6.151611
sum[37]	97969.597906	97965.898785	3.699121
sum[38]	97740.764431	97741.244955	-0.480524
sum[39]	54131.199064	54130.449057	0.750007
sum[40]	41354.888962	41360.449196	-5.560234
sum[41]	27076.740513	27078.459734	-1.719221
sum[42]	28407.017031	28405.41383	1.603201
sum[43]	64281.085072	64278.722779	2.362293
sum[44]	-6049.092757	-6048.547794	-0.544963
sum[45]	1011.51904	1011.534915	-0.015875

As it can be seen there is a significant difference between output of CPU and GPU in the 10th layer. According to the .cfg file the 10th layer is a convolutional layer with dilated convolution. I did the same test with tiny yolov3 (which does not have any dilated convolution) and output difference between CPU and GPU for all the layers was smaller than 16. Therefore the problem must be in calculation of dilated convolution. Currently I am trying to find and solve it but still didn't get any success yet.

mrhosseini commented 4 years ago

@AlexeyAB Well, I think that I have found the problem, but I don't know how to fix it. Checking forward_convolutional_layer() and forward_convolutional_layer_gpu() I noticed that the functionality is similar except when CUDNN is active. I recompiled the code with GPU = 1 but CUDNN = 0 and CUDNN_HALF = 0 and did the above test. Without CUDNN, the results of CPU and GPU are almost similar. The problem does not exists when dilation = 1 in a convolutional layer. When CUDNN is active cudnnSetConvolution2dDescriptor() and cudnnConvolutionForward() are used. When not, the only function that uses the dilation is im2col_cpu_ext() for CPU or im2col_gpu_ext() for GPU. I guess that the implementation of dilation in these two functions (which are both similar and based on the BVLC / caffe) is different from what is implemented in CUDNN. But I don't know how to make them similar to that.

I should note that I have a weight file which is obtained from training with CUDNN. With this weight file some objects can be detected on sample images when CUDNN is active. However, when CUDNN is not active, nothing is detected on the same images.

mrhosseini commented 4 years ago

It seems that the implementation of convolutional layer when dilation > 1 and CUDNN is not active is not correct. Here is the result of training the network using GPU with and without enabling CUDNN:

GPU = 1, CUDNN = 1, CUDNN_HALF = 1 chart_dilation_cuDNN

GPU = 1, CUDNN = 0, CUDNN_HALF = 0 chart_dilation_gpu

AlexeyAB commented 4 years ago

Do you mean that for dilation=2 ?

GPU=0 CUDNN=0 - not working properly
GPU=1 CUDNN=0 - not working properly
GPU=1 CUDNN=1 - working properly

mrhosseini commented 4 years ago

Do you mean that for dilation=2 ?

1. `GPU=0 CUDNN=0` - not working properly

2. `GPU=1 CUDNN=0` - not working properly

3. `GPU=1 CUDNN=1` - working properly

Yes, exactly.

mrhosseini commented 4 years ago

Finally I was able to fix the problem.

When CUDNN is active cudnnSetConvolution2dDescriptor() is called in convolutional_layer.c. The second and third arguments of this function are pad_h and pad_w:

cudnnStatus_t cudnnSetConvolution2dDescriptor(
    cudnnConvolutionDescriptor_t    convDesc,
    int                             pad_h,
    int                             pad_w,
    int                             u,
    int                             v,
    int                             dilation_h,
    int                             dilation_w,
    cudnnConvolutionMode_t          mode,
    cudnnDataType_t                 computeType)

and the function is called in convolutional_layer.c like this:

cudnnSetConvolution2dDescriptor(l->convDesc, l->pad * l->dilation, l->pad * l->dilation, l->stride_y, l->stride_x, l->dilation, l->dilation, CUDNN_CROSS_CORRELATION, CUDNN_DATA_FLOAT)

On the other side when CUDNN is not active, For CPU im2col_cpu_ext() is called in convolutional_layer.c and for GPU without CUDNN im2col_gpu_ext() is called in convolutional_kernels.cu in a similar way.

im2col_cpu_ext(im,          // input
                    l.c / l.groups,         // input channels
                    l.h, l.w,               // input size (h, w)
                    l.size, l.size,         // kernel size (h, w)
                    l.pad, l.pad,           // padding (h, w)
                    l.stride_y, l.stride_x,     // stride (h, w)
                    l.dilation, l.dilation, // dilation (h, w)
                    state.workspace);       // output

As it can be seen the padding arguments of cudnnSetConvolution2dDescriptor() are set to l->pad * l->dilation while the padding for im2col_cpu_ext() and im2col_gpu_ext() are set to l.pad.

As a result, changing the padding arguments of im2col_cpu_ext() and im2col_gpu_ext() from l.pad to l.pad * l.dilation solved the problem and now the result of CPU, GPU without CUDNN and GPU with CUDNN are the same.

However, I don't know why the padding arguments of cudnnSetConvolution2dDescriptor() are l->pad * l->dilation and which of l->pad * l->dilation or l->pad are correct.

Also I'm not sure if similar change should be done on im2col_cpu_ext() and col2im_cpu_ext() in backward_convolutional_layer().

I will issue a pull request if I ensure that the changes are correct. Thanks for your help @AlexeyAB.

AlexeyAB commented 4 years ago

As a result, changing the padding arguments of im2col_cpu_ext() and im2col_gpu_ext() from l.pad to l.pad * l.dilation solved the problem and now the result of CPU, GPU without CUDNN and GPU with CUDNN are the same.

Thanks! l->pad * l->dilation is correct.

Also I'm not sure if similar change should be done on im2col_cpu_ext() and col2im_cpu_ext() in backward_convolutional_layer().

Yes, there should be used l->pad * l->dilation too.

You can make a Pull Request with these changes to this repo. Or I will fix it a bit later.

AlexeyAB / darknet

Different results between CPU and GPU (Convolutional layer has problem when dilation > 1 and CUDNN is not active) #4836