Open mrhosseini opened 4 years ago
What params did you use in the Makefile in both cases?
Without CUDA:
GPU=0
CUDNN=0
CUDNN_HALF=0
OPENCV=0
AVX=0
OPENMP=0
LIBSO=0
ZED_CAMERA=0
CUDA:
GPU=1
CUDNN=1
CUDNN_HALF=1
OPENCV=1
AVX=0
OPENMP=0
LIBSO=1
ZED_CAMERA=0
@AlexeyAB
Could you please guide me with the directions to solve this problem? I am not sure from where to start or what to check?
Different results between CPU and GPU
Show the difference.
May be there is bug in [reorg3d] since it isn't well tested. I think I should rewrite this layer from the scratch. Try to find the error, or try to avoid reorg3d, does this help?
@AlexeyAB
Thanks for your help. I tested the [reorg3d] layer independently with some manual input, but I didn't see any difference between the output of CPU and GPU in this layer. Next I added some code to forward_network()
and forward_network_gpu()
to calculate the total sum of output elements in each layer and compared them. Here is an example for my .cfg file:
sum[%layer] | CPU | GPU | CPU – GPU |
---|---|---|---|
sum[0] | 613456.688756 | 613456.68694 | 0.001816 |
sum[1] | 351479.923 | 351487.230334 | -7.307334 |
sum[2] | 191303.358827 | 191302.797801 | 0.561026 |
sum[3] | 393020.180122 | 393037.718248 | -17.538126 |
sum[4] | 199410.899027 | 199409.187487 | 1.711540 |
sum[5] | 105360.992388 | 105360.366701 | 0.625687 |
sum[6] | 205008.054257 | 205008.69806 | -0.643803 |
sum[7] | 94430.198173 | 94430.778119 | -0.579946 |
sum[8] | 205949.192374 | 205946.756022 | 2.436352 |
sum[9] | 97521.177134 | 97517.417314 | 3.759820 |
sum[10] | 206803.803461 | 213020.145404 | -6216.341943 |
sum[11] | 97317.056703 | 110235.342102 | -12918.285399 |
sum[12] | 194838.233836 | 207752.759416 | -12914.525580 |
sum[13] | 89708.116007 | 99181.952804 | -9473.836797 |
sum[14] | 201686.877354 | 228163.717362 | -26476.840008 |
sum[15] | 92507.410713 | 105634.985002 | -13127.574289 |
sum[16] | 182215.526719 | 204816.937806 | -22601.411087 |
sum[17] | 103781.770411 | 103379.089134 | 402.681277 |
sum[18] | 187229.29314 | 196699.370117 | -9470.076977 |
sum[19] | 87579.457722 | 91369.179553 | -3789.721831 |
sum[20] | 191361.228132 | 194748.268687 | -3387.040555 |
sum[21] | 100206.415242 | 104301.907502 | -4095.492260 |
sum[22] | 50600.791245 | 52192.28841 | -1591.497165 |
sum[23] | 109251.02559 | 111743.667888 | -2492.642298 |
sum[24] | 64515.661368 | 61162.404301 | 3353.257067 |
sum[25] | 143081.962076 | 127353.707183 | 15728.254893 |
sum[26] | 205949.192374 | 205946.756022 | 2.436352 |
sum[27] | 44364.052057 | 44367.559203 | -3.507146 |
sum[28] | 44364.052057 | 44367.559203 | -3.507146 |
sum[29] | 187446.014133 | 171721.266386 | 15724.747747 |
sum[30] | 121096.92386 | 120751.287471 | 345.636389 |
sum[31] | 293955.45623 | 287118.418319 | 6837.037911 |
sum[32] | -137878.726125 | -132878.892983 | -4999.833142 |
sum[33] | 11432.234913 | 12133.453549 | -701.218636 |
sum[34] | 351479.923 | 351487.230334 | -7.307334 |
sum[35] | 168728.649886 | 168732.069919 | -3.420033 |
sum[36] | 196935.044232 | 196928.892621 | 6.151611 |
sum[37] | 97969.597906 | 97965.898785 | 3.699121 |
sum[38] | 97740.764431 | 97741.244955 | -0.480524 |
sum[39] | 54131.199064 | 54130.449057 | 0.750007 |
sum[40] | 41354.888962 | 41360.449196 | -5.560234 |
sum[41] | 27076.740513 | 27078.459734 | -1.719221 |
sum[42] | 28407.017031 | 28405.41383 | 1.603201 |
sum[43] | 64281.085072 | 64278.722779 | 2.362293 |
sum[44] | -6049.092757 | -6048.547794 | -0.544963 |
sum[45] | 1011.51904 | 1011.534915 | -0.015875 |
As it can be seen there is a significant difference between output of CPU and GPU in the 10th layer. According to the .cfg file the 10th layer is a convolutional layer with dilated convolution. I did the same test with tiny yolov3 (which does not have any dilated convolution) and output difference between CPU and GPU for all the layers was smaller than 16. Therefore the problem must be in calculation of dilated convolution. Currently I am trying to find and solve it but still didn't get any success yet.
@AlexeyAB
Well, I think that I have found the problem, but I don't know how to fix it. Checking forward_convolutional_layer()
and forward_convolutional_layer_gpu()
I noticed that the functionality is similar except when CUDNN is active. I recompiled the code with GPU = 1
but CUDNN = 0
and CUDNN_HALF = 0
and did the above test. Without CUDNN, the results of CPU and GPU are almost similar. The problem does not exists when dilation = 1
in a convolutional layer.
When CUDNN is active cudnnSetConvolution2dDescriptor()
and cudnnConvolutionForward()
are used. When not, the only function that uses the dilation is im2col_cpu_ext()
for CPU or im2col_gpu_ext()
for GPU. I guess that the implementation of dilation in these two functions (which are both similar and based on the BVLC / caffe) is different from what is implemented in CUDNN. But I don't know how to make them similar to that.
I should note that I have a weight file which is obtained from training with CUDNN. With this weight file some objects can be detected on sample images when CUDNN is active. However, when CUDNN is not active, nothing is detected on the same images.
It seems that the implementation of convolutional layer when dilation > 1
and CUDNN is not active is not correct. Here is the result of training the network using GPU with and without enabling CUDNN:
GPU = 1, CUDNN = 1, CUDNN_HALF = 1
GPU = 1, CUDNN = 0, CUDNN_HALF = 0
Do you mean that for dilation=2
?
GPU=0 CUDNN=0
- not working properlyGPU=1 CUDNN=0
- not working properlyGPU=1 CUDNN=1
- working properlyDo you mean that for
dilation=2
?1. `GPU=0 CUDNN=0` - not working properly 2. `GPU=1 CUDNN=0` - not working properly 3. `GPU=1 CUDNN=1` - working properly
Yes, exactly.
Finally I was able to fix the problem.
When CUDNN is active cudnnSetConvolution2dDescriptor()
is called in convolutional_layer.c. The second and third arguments of this function are pad_h
and pad_w
:
cudnnStatus_t cudnnSetConvolution2dDescriptor(
cudnnConvolutionDescriptor_t convDesc,
int pad_h,
int pad_w,
int u,
int v,
int dilation_h,
int dilation_w,
cudnnConvolutionMode_t mode,
cudnnDataType_t computeType)
and the function is called in convolutional_layer.c like this:
cudnnSetConvolution2dDescriptor(l->convDesc, l->pad * l->dilation, l->pad * l->dilation, l->stride_y, l->stride_x, l->dilation, l->dilation, CUDNN_CROSS_CORRELATION, CUDNN_DATA_FLOAT)
On the other side when CUDNN is not active, For CPU im2col_cpu_ext()
is called in convolutional_layer.c and for GPU without CUDNN im2col_gpu_ext()
is called in convolutional_kernels.cu in a similar way.
im2col_cpu_ext(im, // input
l.c / l.groups, // input channels
l.h, l.w, // input size (h, w)
l.size, l.size, // kernel size (h, w)
l.pad, l.pad, // padding (h, w)
l.stride_y, l.stride_x, // stride (h, w)
l.dilation, l.dilation, // dilation (h, w)
state.workspace); // output
As it can be seen the padding arguments of cudnnSetConvolution2dDescriptor()
are set to l->pad * l->dilation
while the padding for im2col_cpu_ext()
and im2col_gpu_ext()
are set to l.pad
.
As a result, changing the padding arguments of im2col_cpu_ext()
and im2col_gpu_ext()
from l.pad
to l.pad * l.dilation
solved the problem and now the result of CPU, GPU without CUDNN and GPU with CUDNN are the same.
However, I don't know why the padding arguments of cudnnSetConvolution2dDescriptor()
are l->pad * l->dilation
and which of l->pad * l->dilation
or l->pad
are correct.
Also I'm not sure if similar change should be done on im2col_cpu_ext()
and col2im_cpu_ext()
in backward_convolutional_layer()
.
I will issue a pull request if I ensure that the changes are correct. Thanks for your help @AlexeyAB.
As a result, changing the padding arguments of im2col_cpu_ext() and im2col_gpu_ext() from l.pad to l.pad * l.dilation solved the problem and now the result of CPU, GPU without CUDNN and GPU with CUDNN are the same.
Thanks!
l->pad * l->dilation
is correct.
Also I'm not sure if similar change should be done on im2col_cpu_ext() and col2im_cpu_ext() in backward_convolutional_layer().
Yes, there should be used l->pad * l->dilation
too.
You can make a Pull Request with these changes to this repo. Or I will fix it a bit later.
Hi,
Using this network.cfg.txt, the detection results of the CUDA enabled build of this repository are different from the build without enabling CUDA. The CUDA enabled build produces better results. Using Tiny YoloV3, there is no difference.