avg loss = -nan when tensor cores are used

AlexeyAB / darknet

YOLOv4 / Scaled-YOLOv4 / YOLO - Neural Networks for Object Detection (Windows and Linux version of Darknet )

http://pjreddie.com/darknet/

Other

21.66k stars 7.96k forks source link

avg loss = -nan when tensor cores are used #2783

Open drapado opened 5 years ago

drapado commented 5 years ago

I'm training normal yolo-v3 with the latest version of this repository on a rtx 2060 and the training process goes well until I reach 3000 iterations and tensor cores are used. Right after reaching iteration 3000 the avg loss becomes -nan.

I had the same error some days ago while training a classifier #2660.

When I set CUDNN_HALF = 0 in the makefille and repeat the training process everything continues well when I reach iteration 3000 and beyond.

I'm using arch linux, so I had to modify the makefile to change the location of cuda (v10.0) and cudnn (v7.5).

I attach both makefile and cfg I'm using

Makefile.txt yolov3_j1.cfg.txt

drapado commented 5 years ago

@AlexeyAB I had again the same issue while training with yolov3-3l. However, I trained with yolov3-tiny-3l and nothing happened in more than 20000 iterations

AlexeyAB commented 5 years ago

@drapado

What command do you use for training?

Do you set GPU=1 CUDNN=1 CUDNN_HALF=1 OPENCV=1?

Try to set try_fix_nan=1 in the [net] section in your yolov3.cfg file and train again with CUDNN_HALF=1.

drapado commented 5 years ago

Do you set GPU=1 CUDNN=1 CUDNN_HALF=1 OPENCV=1?

Yes, I used the makefile I posted in the first message.

Try to set try_fix_nan=1 in the [net] section in your yolov3.cfg file and train again with CUDNN_HALF=1.

Ok, I'll try

Thanks!

drapado commented 5 years ago

I had again the same issue while training with yolov3-3l.

Sorry, here I meant yolov3-5l

AlexeyAB commented 5 years ago

So do you get NAN for both yolov3.cfg and yolov3-5l.cfg? But you don't get it for yolov3-tiny-3l.cfg?

Also try to train yolov3.cfg without modified [route] and [upsample] layers. I.e. use default:

[upsample]
stride=2

[route]
layers = -1, 36

drapado commented 5 years ago

So do you get NAN for both yolov3.cfg and yolov3-5l.cfg? But you don't get it for yolov3-tiny-3l.cfg?

Yes, I get nan for yolov3, yolov3-5l and darknet53 classifier; but no with yolov3-tiny-3l.

Also try to train yolov3.cfg without modified [route] and [upsample] layers. I.e. use default:

I'll try and post the results, thanks a lot for your help

drapado commented 5 years ago

Hi @AlexeyAB,

I tried try_fix_nan=1 in the [net] part but the avg loss goes crazy with 10^9 values, this with yolov3-5l.

I also tried yolov3.cfg without modified [route] and [upsample] and it seems to works without the error, although I cannot use it since I have to detect small objects.

AlexeyAB commented 5 years ago

@drapado Hi,

If you need to detect small objects, then try to train non-modified yolov3-spp.cfg with higher resolution width=832 height=832

The reason of NAN may be that you have very much small objects (smaller than 1x1 pixel after resizing image to the network size).

drapado commented 5 years ago

The reason of NAN may be that you have very much small objects (smaller than 1x1 pixel after resizing image to the network size).

I checked and I don't have objects smaller than 5x5 when resizing, can that also generate the problem?

i-chaochen commented 5 years ago

I have the same nan problem when the tensor cores are used.

So the correct solution is to set CUDNN_HALF = 0 in makefile remake again? @AlexeyAB

AlexeyAB commented 5 years ago

@i-chaochen Yes.