AlexeyAB / darknet

YOLOv4 / Scaled-YOLOv4 / YOLO - Neural Networks for Object Detection (Windows and Linux version of Darknet )
http://pjreddie.com/darknet/
Other
21.63k stars 7.95k forks source link

"cuDNN Error: CUDNN_STATUS_EXECUTION_FAILED" with ZED SDK #7548

Open elias-work opened 3 years ago

elias-work commented 3 years ago

OS Ubuntu 20.04 CUDA 11.2 CUDNN 8.1.1 GPU Quadro T2000 (capability 7.5) darknet interface Python 3.8 (darknet.py) Model YOLOv4 custom trained, 416x416 Make options

GPU=1
CUDNN=1
CUDNN_HALF=0
OPENCV=1
AVX=0
OPENMP=0
LIBSO=1
ZED_CAMERA=1
ZED_CAMERA_v2_8=0

In the project I am working on, we can run YOLOv4 on a video file (read via opencv) and this works very well. I am attempting to run YOLOv4 on a ZED recording read via ZED SDK. Unfortunately with it loaded I get this strange error in the very first darknet.py:detect_image call:

 cuDNN status Error in: file: ./src/convolutional_kernels.cu : () : line: 541 : build time: Mar 26 2021 - 16:56:54 

 cuDNN Error: CUDNN_STATUS_EXECUTION_FAILED
python3: : Unknown error -1990738446

Setting CUDNN=0 changes the error to:

CUDA status Error: file: ./src/blas_kernels.cu : () : line: 859 : build time: Mar 26 2021 - 15:44:45 

 CUDA Error: invalid resource handle
python3: : Unknown error -1946496751

(related? https://github.com/stereolabs/zed-yolo/issues/29)

I can comment the detect_image call and it will keep reading video correctly and never crash. Note, we are passing (416,416,3) data to detect_image and I have made sure it is correct when it crashes.

When it crashes I still have >1GB video memory free.

So I am not 100% certain this is a darknet issue, but since my opaque error happens in darknet, does anyone have any advice for things to try? I have exhausted my list of reasonable things to try, I would like to test on a better GPU but am unable to for right now. I have not tested CUDNN_HALF=1 because I cannot use it in general as my GPU lacks tensor cores.

elias-work commented 3 years ago

I have tested on an RTX 2080 Ti with CUDA 10.2 and Ubuntu 18.04 now too with the same issue.

elias-work commented 3 years ago

It turns out the issue was because I had parallelized ZED camera reading and detection to improve performance on one streaming source. Apparently the CUDA calls are unsafe in this context.

Is this predictable, or does it suggest a bug, perhaps in ZED SDK?

SunSonia commented 3 years ago

I also encountered the same problem. Has anyone solved it?

yotamraz commented 2 years ago

same issue here when running yolov4-tiny on Xavier-NX, compiling with: GPU=1 CUDNN=1 CUDNN_HALF=1 OPENCV=1 AVX=0 OPENMP=0 LIBSO=1 ZED_CAMERA=1 ZED_CAMERA_v2_8=0

Has anyone solved it?