AlexeyAB / darknet

YOLOv4 / Scaled-YOLOv4 / YOLO - Neural Networks for Object Detection (Windows and Linux version of Darknet )
http://pjreddie.com/darknet/
Other
21.65k stars 7.96k forks source link

CUDNN error at training iteration 1000 when calculating mAP% #8669

Open stephanecharette opened 1 year ago

stephanecharette commented 1 year ago

Upgraded my Ubuntu 20.04 training rig to install latest patches. This included a new version of CUDNN. Now using CUDA 11.7.1-1 and CUDNN 8.5.0.96-1+cuda11.7. Darknet is at latest version from 2022-08-16:

> git log -1
commit 96f08de6839eb1c125c7b86bffe1d3dde9570e5b (HEAD -> master, origin/master, origin/HEAD)
Author: Stefano Sinigardi <stesinigardi@hotmail.com>
Date:   Tue Aug 16 20:20:48 2022 +0200

All of my existing neural networks fail to train. Some are YOLOv4-tiny, others are YOLOv4-tiny-3L. Training rig is nvidia 3090 with 24 GB of vram, and networks fit well in vram. When darknet gets to iteration 1000 in training where it does the first mAP calculation, it produces this error:

 (next mAP calculation at 1000 iterations) 
 1000: 1.540665, 2.618338 avg loss, 0.002600 rate, 1.743389 seconds, 64000 images, 2.605252 hours left
4Darknet error location: ./src/dark_cuda.c, cudnn_check_error, line #204
cuDNN Error: CUDNN_STATUS_BAD_PARAM: Success

 calculation mAP (mean average precision)...
 Detection layer: 30 - type = 28 
 Detection layer: 37 - type = 28 
 Detection layer: 44 - type = 28 

 cuDNN status Error in: file: ./src/convolutional_kernels.cu : () : line: 543 : build time: Sep 13 2022 - 17:44:16 

 cuDNN Error: CUDNN_STATUS_BAD_PARAM
Command exited with non-zero status 1

The only important thing I can think which has changed today is that I installed the latest version of CUDNN8. This is the relevant portion of the upgrade log:

Preparing to unpack .../04-libcudnn8-dev_8.5.0.96-1+cuda11.7_amd64.deb ...
update-alternatives: removing manually selected alternative - switching libcudnn to auto mode
Unpacking libcudnn8-dev (8.5.0.96-1+cuda11.7) over (8.4.1.50-1+cuda11.6) ...
Preparing to unpack .../05-libcudnn8_8.5.0.96-1+cuda11.7_amd64.deb ...
Unpacking libcudnn8 (8.5.0.96-1+cuda11.7) over (8.4.1.50-1+cuda11.6) ...

Curious to know if anyone else has a problem with CUDNN 8.5.0.96, or have an idea as to how to fix this problem.

stephanecharette commented 1 year ago

Downgraded CUDNN from 8.5.0 back to 8.4.1.50. Training works again. This is the command I used to downgrade:

sudo apt-get install libcudnn8-dev=8.4.1.50-1+cuda11.6 libcudnn8=8.4.1.50-1+cuda11.6
1027663760 commented 1 year ago

The latest version of cudnn always has various bugs

chgoatherd commented 1 year ago

modify copy_weights_net(...) of network.c

void copy_weights_net(network net_train, network* net_map) { int k;

for (k = 0; k < net_train.n; ++k)
{
    layer* l = &(net_train.layers[k]);
    layer tmp_layer;

    copy_cudnn_descriptors(net_train.layers[k], &tmp_layer);
    net_map->layers[k] = net_train.layers[k];
    copy_cudnn_descriptors(tmp_layer, &net_train.layers[k]);

    if (l->type == CRNN)
    {
        layer tmp_input_layer, tmp_self_layer, tmp_output_layer;

        copy_cudnn_descriptors(*net_train.layers[k].input_layer, &tmp_input_layer);
        copy_cudnn_descriptors(*net_train.layers[k].self_layer, &tmp_self_layer);
        copy_cudnn_descriptors(*net_train.layers[k].output_layer, &tmp_output_layer);
        net_map->layers[k].input_layer = net_train.layers[k].input_layer;
        net_map->layers[k].self_layer = net_train.layers[k].self_layer;
        net_map->layers[k].output_layer = net_train.layers[k].output_layer;
        //net_map->layers[k].output_gpu = net_map->layers[k].output_layer->output_gpu;  // already copied out of if()

        copy_cudnn_descriptors(tmp_input_layer, net_train.layers[k].input_layer);
        copy_cudnn_descriptors(tmp_self_layer, net_train.layers[k].self_layer);
        copy_cudnn_descriptors(tmp_output_layer, net_train.layers[k].output_layer);
    }
    else if (l->input_layer) // for AntiAliasing
    {
        layer tmp_input_layer;

        copy_cudnn_descriptors(*net_train.layers[k].input_layer, &tmp_input_layer);
        net_map->layers[k].input_layer = net_train.layers[k].input_layer;
        copy_cudnn_descriptors(tmp_input_layer, net_train.layers[k].input_layer);
    }

    net_map->layers[k].batch = 1;
    net_map->layers[k].steps = 1;
    net_map->layers[k].train = 0;
}

}

chgoatherd commented 1 year ago

Please refer to issues #8667

stephanecharette commented 1 year ago

Tried to use libcudnn8 8.6.0.163 today with cuda 11.8. Same problem still exists, aborts when it hits iteration # 1000. Used the command in the comment above and downgraded to lubcudnn8 8.4.1.50. Problem went away. This needs to be fixed...

https://github.com/AlexeyAB/darknet/issues/8669#issuecomment-1246194925

stephanecharette commented 1 year ago

@AlexeyAB do you have thoughts on the fix for this? Do you need a pull request for @chgoatherd's proposed changes, or is this going down the wrong path?

ryj0902 commented 1 year ago

same problem as issue, but solved after applying @chgoatherd's suggestion + change subdivision=16 → 32
(changing subdivision value is not related to this issue but CUDA OOM error)

nailsonlinux commented 1 year ago

I got the same error here at 1000 iterations, firstly I just got the -map option out. The training is going well after 1000 iterations.

Later, I downgraded the cudnn and it worked. In my case I'm using CUDA 11.2, with a container

hnothing2016 commented 1 year ago

I got the same error, Docker

mari9myr commented 1 year ago

Same problem. Ubuntu 20.04 GPU RTX A6000 46GB Nvidia Driver: 515.65.01 Makefile: GPU 1, CUDNN 1, CUDNN_HALF 1, OPENCV 1, OPENMP 1, LIBSO 1 ARCH= -gencode arch=compute_86,code=[sm_86,compute_86] Cuda Toolkit 11.[7-8] + cuDNN 8.[6-7-8] works only if you use subdivision=batch=64 or in case subdivision is smaller than batch you remove the "-map" parameter on the darknet training command. Then I followed stephanecharette instructions and downgraded to the versions Cuda Toolkit 11.6 + cuDNN 8.4.1 and now everything works great with “-map” even decreasing the subdivision value down to 8.

jackneil commented 1 year ago

@chgoatherd your solution and then a rebuild fixed my issues when using -map with cuda 11.x on a 3090 as well. Solid. You should create a pull request and get that merged in

avkwok commented 1 year ago

@chgoatherd Thanks a lot. I hit the same problem and your solution helped me to solve. I changed network.c and recompile darknet with vcpkg, CUDA v11.8 and CUDNN v8.6 on Windows 11. Now, everything works fine.

stephanecharette commented 1 year ago

I've made the changes that @chgoatherd listed above, switching out net_map->... for net_train... in that function. But I'm still seeing the same error when it attempts to calculate the map at iteration 1000.

stephanecharette commented 1 year ago

I'm using Ubuntu 20.04.6, CUDA 12.1.105-1, and CUDNN 8.9.1.23-1+cuda12.1. With the changes to network.c from @chgoatherd listed above from 2022-09-15, the error looks like this:

 calculation mAP (mean average precision)...
 Detection layer: 30 - type = 28 
 Detection layer: 37 - type = 28 
4CUDA status Error: file: ./src/network_kernels.cu: func: network_predict_gpu() line: 735

 CUDA Error: an illegal memory access was encountered
Darknet error location: ./src/network_kernels.cu, network_predict_gpu(), line #735
CUDA Error: an illegal memory access was encountered: Success
backtrace (11 entries)
1/11: darknet(log_backtrace+0x38) [0x562adf9d1dd8]
2/11: darknet(error+0x3d) [0x562adf9d1ebd]
3/11: darknet(check_error+0xd0) [0x562adf9d4eb0]
4/11: darknet(check_error_extended+0x7c) [0x562adf9d4f9c]
5/11: darknet(network_predict_gpu+0x15f) [0x562adfad509f]
6/11: darknet(validate_detector_map+0x9ad) [0x562adfa64f6d]
7/11: darknet(train_detector+0x16a4) [0x562adfa67ca4]
8/11: darknet(run_detector+0x897) [0x562adfa6bc57]
9/11: darknet(main+0x34d) [0x562adf98663d]
10/11: /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3) [0x7f433e662083]
11/11: darknet(_start+0x2e) [0x562adf9888be]

Without the changes to network.c, the error looks like this:

 calculation mAP (mean average precision)...
 Detection layer: 30 - type = 28 
 Detection layer: 37 - type = 28 
4
 cuDNN status Error in: file: ./src/convolutional_kernels.cu function: forward_convolutional_layer_gpu() line: 543

 cuDNN Error: CUDNN_STATUS_BAD_PARAM
Darknet error location: ./src/convolutional_kernels.cu, forward_convolutional_layer_gpu(), line #543
cuDNN Error: CUDNN_STATUS_BAD_PARAM: Success
backtrace (13 entries)
1/13: darknet(log_backtrace+0x38) [0x5588eb21bdd8]
2/13: darknet(error+0x3d) [0x5588eb21bebd]
3/13: darknet(+0x8bd40) [0x5588eb21ed40]
4/13: darknet(cudnn_check_error_extended+0x7c) [0x5588eb21f2fc]
5/13: darknet(forward_convolutional_layer_gpu+0x2c2) [0x5588eb307802]
6/13: darknet(forward_network_gpu+0x101) [0x5588eb31c281]
7/13: darknet(network_predict_gpu+0x131) [0x5588eb31f0a1]
8/13: darknet(validate_detector_map+0x9ad) [0x5588eb2aef9d]
9/13: darknet(train_detector+0x16a4) [0x5588eb2b1cd4]
10/13: darknet(run_detector+0x897) [0x5588eb2b5c87]
11/13: darknet(main+0x34d) [0x5588eb1d063d]
12/13: /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3) [0x7fb67bcba083]
13/13: darknet(_start+0x2e) [0x5588eb1d28be]

So the call stack and the error message from CUDA/CUDNN are not exactly the same. I think there are multiple issues, and the changes from above exposes the next problem.

IMPORTANT

People looking for a quick workaround for this issue, especially if training on hardware you don't own like Google colab where it is complicated to downgrade CUDA/CUDNN:

This is not ideal, but will get you past the problem until a solution is found.

stephanecharette commented 1 year ago

The release notes for CUDNN v8.5.0 -- where the problem started -- contains this text:

A buffer was shared between threads and caused segmentation faults. There was previously no way to have a per-thread buffer to avoid these segmentation faults. The buffer has been moved to the cuDNN handle. Ensure you have a cuDNN handle for each thread because the buffer in the cuDNN handle is only for the use of one thread and cannot be shared between two threads.

This sounds like a possible issue. I believe the cudnn handle is initialized in dark_cuda.c, and it looks like it is a global variable shared between all threads. See the two calls to cudnnCreate(), as well as the variables cudnnInit, cudnnHandle, switchCudnnInit and switchCudnnhandle.

stephanecharette commented 1 year ago

Until a proper solution is found, this is still the solution I employ on my training rigs:

sudo apt-get install libcudnn8-dev=8.4.1.50-1+cuda11.6 libcudnn8=8.4.1.50-1+cuda11.6 sudo apt-mark hold libcudnn8-dev sudo apt-mark hold libcuddn8

As stated 2 comments above, another possible workaround is to disable CUDNN in the Darknet Makefile.

avmusat commented 1 year ago

I made the change @chgoatherd suggested above, and it seems to work on Ubuntu 22.04.2 LTS with CUDA 11.7 + CUDNN 8.9.0

stephanecharette commented 1 year ago

Unfortunately several (not all) of my neural network still cause the error to happen even with those changes.

stephanecharette commented 1 year ago

Wondering if the fixes made here might finally solve this issue: https://github.com/hank-ai/darknet/commit/1ea2baf0795c22804e1ef69ddc1d7b1e73d80b0d

stephanecharette commented 1 year ago

Preliminary tests show that this appears to have been fixed by that commit. See the Hank.ai Darknet repo. https://github.com/hank-ai/darknet/commit/1ea2baf0795c22804e1ef69ddc1d7b1e73d80b0d

nailsonlinux commented 1 year ago

Thanks for sharing it @stephanecharette!

asyilanrftr commented 10 months ago

how to solve this problem?

xxtkidxx commented 6 months ago

I have the same error as you on RTX3060, RTX2070 Super works normally, have you fixed it yet?

stephanecharette commented 6 months ago

Yes, this is fixed in the new Darknet/YOLO repo: https://github.com/hank-ai/darknet#table-of-contents