some nan during training

Whisper94 commented 4 years ago

Hello!

I need your help!

I have read all instruction how to train on custom dataset. It works with training. BUT!

If I train yolov3 I'll get some nan in Avg. Although there are some nan the chart displays very good mAP.

What I've done? My dataset has simple images with 150x150 resolution. I'm training yolo on 32x32 shape to detect these small objects on FullHD or 4k resolution. I think it is a good approach to detect small objects if train croped images at first to detect them in the full image. But it in in darknet in yolov3 it doesn't work. With a fork it worked very well. After training yolov4 with the same training parameters it works but not enough or I don't know which parameters I should change in cfg to achieve the best performance. In yolov4 during training I had also pretty good mAP but the loss was general higher then at yolov3 final_after_all_iterations

You see two images with training yolov3. Sometimes 1/3 is nan. Sometimes is 2/3 is nan although mAP and loss are pretty good. 2020-05-19 22_13_20-Window 2020-05-19 21_48_30-Window In this training I've tried to train original images on 128x128 shape in cfg file and after that to detect objects with 1922x1440 resolution but it was very bad. With thresh 0.1 there were only some objects with score about 10-20%. If I reduse shape in cfg for detecting there will be no or wrong detections or if increasing - out of memory (with batch and subdivisions 64 to 64).

To accelerate the training with the small resolution I use in cfg file batch 512 and sub 16 for 32x32 resolution. Another parameters are like in instructure.

Could you explain what could I improve in my parameters? Thank you!!!

WongKinYiu commented 4 years ago

every data augmentation will make loss become higher.
when count=0 (it mens there is no any ground truth of object is assigned to the corresponding anchors), the loss must be nan.

i suggest you to recalculate the anchors.

Whisper94 commented 4 years ago

Hello. Thank you very much for your fast answer. I use default anchors in cfg file.

anchors = 10,13, 16,30, 33,23, 30,61, 62,45, 59,119, 116,90, 156,198, 373,326

Do you know how can I recalculate new anchors for so small objects? In training dataset my resolution of the image (in cfg file) is 32x32. So the objects are about 15x15 pixel on the image. But if I reshape resolution to 128x128 so will my objects about 50x50 pixels and it should be big enough for detecting ground truth with corresponding anchors.

Thank you!

AlexeyAB commented 4 years ago

I see that GIoU and IoU=nan, but I don't see Nan avg loss

https://github.com/AlexeyAB/darknet#how-to-train-to-detect-your-custom-objects

Note: If during training you see nan values for avg (loss) field - then training goes wrong, but if nan is in some other lines - then training goes well.

Whisper94 commented 4 years ago

@AlexeyAB thank you for your comment and sorry. I have wrong understood and thought that if at Avg(IOU:...) is nan then is something is wrong. Because of I'll get during training 1/3 up to 2/3 nan.

Could you comment if I should recalculate anchors for training or how could I improve detection. Because with training 32x32 resolution and detecting on 1920x1920 I'll get threshold between 10%-50%. If lower resolution for detecting then it shows nothing...

AlexeyAB commented 4 years ago

Do you use width=32 height=32 for training? Or what do you mean? Show your training and testing images.

AlexeyAB commented 4 years ago

What mAP do you get on large images?

Whisper94 commented 4 years ago

Sorry for question but how can I check mAP on large image? Just 'map' instead of 'test'? like: darknet.exe detector map data/obj.data yolo_detection.cfg yolov3_final.weights if so, then I get everywhere as AP 0,00%

If I try to detect with WxH 1920x1440 then it works only with threshold 0.1, so I got the following results like in image for lower resolution images it doesn't work

AlexeyAB commented 4 years ago

Sorry for question but how can I check mAP on large image? Just 'map' instead of 'test'? like: darknet.exe detector map data/obj.data yolo_detection.cfg yolov3_final.weights if so, then I get everywhere as AP 0,00%

Yes, set high width and height in cfg file, set valid=valid.txt in obj.data file and run this command. Your validation dataset should be labeled too in the same format as training dataset.

If I try to detect with WxH 1920x1440 then it works only with threshold 0.1, so I got the following results like in image

Can you show image with detectio results?

AlexeyAB / darknet

some nan during training #5680