Open drapado opened 5 years ago
@AlexeyAB I had again the same issue while training with yolov3-3l. However, I trained with yolov3-tiny-3l and nothing happened in more than 20000 iterations
@drapado
What command do you use for training?
Do you set GPU=1 CUDNN=1 CUDNN_HALF=1 OPENCV=1?
Try to set try_fix_nan=1
in the [net]
section in your yolov3.cfg
file and train again with CUDNN_HALF=1
.
Do you set GPU=1 CUDNN=1 CUDNN_HALF=1 OPENCV=1?
Yes, I used the makefile I posted in the first message.
Try to set try_fix_nan=1 in the [net] section in your yolov3.cfg file and train again with CUDNN_HALF=1.
Ok, I'll try
Thanks!
I had again the same issue while training with yolov3-3l.
Sorry, here I meant yolov3-5l
So do you get NAN for both yolov3.cfg
and yolov3-5l.cfg
? But you don't get it for yolov3-tiny-3l.cfg
?
Also try to train yolov3.cfg
without modified [route] and [upsample] layers. I.e. use default:
[upsample]
stride=2
[route]
layers = -1, 36
So do you get NAN for both yolov3.cfg and yolov3-5l.cfg? But you don't get it for yolov3-tiny-3l.cfg?
Yes, I get nan for yolov3, yolov3-5l and darknet53 classifier; but no with yolov3-tiny-3l.
Also try to train yolov3.cfg without modified [route] and [upsample] layers. I.e. use default:
I'll try and post the results, thanks a lot for your help
Hi @AlexeyAB,
I tried try_fix_nan=1
in the [net]
part but the avg loss goes crazy with 10^9 values, this with yolov3-5l.
I also tried yolov3.cfg without modified [route] and [upsample] and it seems to works without the error, although I cannot use it since I have to detect small objects.
@drapado Hi,
If you need to detect small objects, then try to train non-modified yolov3-spp.cfg
with higher resolution width=832 height=832
The reason of NAN may be that you have very much small objects (smaller than 1x1 pixel after resizing image to the network size).
The reason of NAN may be that you have very much small objects (smaller than 1x1 pixel after resizing image to the network size).
I checked and I don't have objects smaller than 5x5 when resizing, can that also generate the problem?
I have the same nan problem when the tensor cores are used.
So the correct solution is to set CUDNN_HALF = 0 in makefile remake again? @AlexeyAB
@i-chaochen Yes.
I'm training normal yolo-v3 with the latest version of this repository on a rtx 2060 and the training process goes well until I reach 3000 iterations and tensor cores are used. Right after reaching iteration 3000 the avg loss becomes -nan.
I had the same error some days ago while training a classifier #2660.
When I set CUDNN_HALF = 0 in the makefille and repeat the training process everything continues well when I reach iteration 3000 and beyond.
I'm using arch linux, so I had to modify the makefile to change the location of cuda (v10.0) and cudnn (v7.5).
I attach both makefile and cfg I'm using
Makefile.txt yolov3_j1.cfg.txt