Closed rsicak closed 1 year ago
Hello, What is the build command you used to build the docker image? can you try to rebuild the docker image by specifying the build arg and without cache ex:
sudo docker build -f docker/Dockerfile -t darknet_yolov4_gpu:1 --build-arg GPU=1 --build-arg DOWNLOAD_ALL=1 --build-arg CUDNN=1 --build-arg CUDNN_HALF=0 --build-arg OPENCV=1 . --no-cache
Hi hadikoub, I have tried to build docker as you suggest (DOWNLOAD_ALL=1 CUDNN_HALF=0 and --no-cache) and it works on older machine with GTX1060 6GB, GPU is utilizing almost 100 percent. When I have build the same docker on the machine with RTX3080 or on the another machine with A4000 with CUDNN_HALF=1 the behavior is strange. After starting docker, only one core is working for a lot of minutes and after that something is doing but GPU not working (nvidia-smi) and in the log file there are for about half an hour only this:
CUDA-version: 10000 (11060), cuDNN: 7.6.5, CUDNN_HALF=1, GPU count: 2
CUDNN_HALF=1
OpenCV version: 3.2.0
0 : compute_capability = 860, cudnn_half = 1, GPU: NVIDIA RTX A4000
layer filters size/strd(dil) input output
And after half an hour in the log file is something like:
v3 (iou loss, Normalizer: (iou: 0.07, cls: 1.00) Region 139 Avg (IOU: 0.000000, GIOU: 0.000000), Class: nan, Obj: nan, No Obj: nan, .5R: 0.000000, .75R: 0.000000, count: 16, class_loss = -nan, iou_loss = -nan, total_loss = -nan
v3 (iou loss, Normalizer: (iou: 0.07, cls: 1.00) Region 150 Avg (IOU: 0.000000, GIOU: 0.000000), Class: nan, Obj: nan, No Obj: nan, .5R: 0.000000, .75R: 0.000000, count: 17, class_loss = -nan, iou_loss = -nan, total_loss = -nan
v3 (iou loss, Normalizer: (iou: 0.07, cls: 1.00) Region 161 Avg (IOU: 0.000000, GIOU: 0.000000), Class: nan, Obj: nan, No Obj: nan, .5R: 0.000000, .75R: 0.000000, count: 5, class_loss = -nan, iou_loss = -nan, total_loss = -nan
It seems that with modern Nvidia GPUs is some problem. Robert
Hello Robert,
Glad that the solution is now working. Regarding the other issue you are facing on newer versions of the RTX 3080 and A4000; This issue is not caused by the solution but it's due to Nvidia Cuda support on newer devices. Currently, Nvidia 30 RTX Series supports Cuda 11.x only without backward compatibility with older versions of Cuda like the one that's currently used in the solution (Cuda 10.0)
Please refer to:
Hi Hadi, thanks for fast response. I will look at it. Best regards. Robert
Dňa 4. 3. 2022 o 18:29, Hadi Koubeissy @.***> napísal:
Hello Robert,
Glad that the solution is now working. Regarding the other issue you are facing on newer versions of the RTX 3080 and A4000; This issue is not caused by the solution but it's due to Nvidia Cuda support on newer devices. Currently, Nvidia 30 RTX Series supports Cuda 11.x only without backward compatibility with older versions of Cuda like the one that's currently used in the solution (Cuda 10.0)
Please refer to:
https://arnon.dk/matching-sm-architectures-arch-and-gencode-for-various-nvidia-cards/ https://arnon.dk/matching-sm-architectures-arch-and-gencode-for-various-nvidia-cards/ https://docs.nvidia.com/cuda/ampere-compatibility-guide/index.html https://docs.nvidia.com/cuda/ampere-compatibility-guide/index.html — Reply to this email directly, view it on GitHub https://github.com/BMW-InnovationLab/BMW-YOLOv4-Training-Automation/issues/35#issuecomment-1059370246, or unsubscribe https://github.com/notifications/unsubscribe-auth/AIH2MACHIMKCYQYESJ7K4VTU6JB6FANCNFSM5PIX5PXQ. Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub. You are receiving this because you authored the thread.
Hello @rsicak Again,
I've created a branch for Cuda 11 support named (cuda11_support) Link: https://github.com/BMW-InnovationLab/BMW-YOLOv4-Training-Automation/tree/cuda11_support but it's still under testing and thus stability is not fully guaranteed.
You can take a look at it in case this is convenient for you.
Hi, I have tried the new branch. Docker compiled and then run. It worked up to 300 iterations and stopped on cudnn error. I have cuda 11.6 so I have replaced "nvidia/cuda:11.1-cudnn8-devel-ubuntu20.04" with newer one "nvidia/cuda:11.5.1-cudnn8-devel-ubuntu20.04" in docker file and recompile docker image. After that it works with RTX3080 and A4000 GPUs. The sample dataset training finished well after some hours. It also works with dual A4000 GPUs.
Hi,
Thank you for the suggestion I'll try to change the docker image to the suggested one and do some tests.
Hi I have the same issue as others in this issue history. I have tried solution to set DOWNLOAD_ALL=1 in dockerfile but not works for me. I have yolov4.weights in the right folder under config/darknet/yolov4_default_weights/ Any help? Thank you. Robert