mAP dropped during training from 73% to 40%

alexis-gruet-deel commented 4 years ago

Hello, Don't really know if this is expected.

Any advices are welcome.

AlexeyAB commented 4 years ago

datasets train/val are small or worng

alexis-gruet-deel commented 4 years ago

Hi and thanks for answering. Dataset is ~15K images w/ 3K for validation, is not enough ? Wrong means high bias or something else?

AlexeyAB commented 4 years ago

https://github.com/AlexeyAB/darknet/wiki/FAQ---frequently-asked-questions
check your dataset - run training with flag -show_imgs i.e. ./darknet detector train ... -show_imgs and look at the aug_...jpg images, do you see correct truth bounded boxes?
rename your cfg-file to txt-file and drag-n-drop (attach) to your message here
show content of generated files bad.list and bad_label.list if they exist

1027663760 commented 4 years ago

What is bad.list and bad_label.list used for? I didn't have these files when I was training

alexis-gruet-deel commented 4 years ago

Hi, through remote X no way to make -show_imgs works because of [xcb] Unknown request in queue while dequeuing see below :

 [...]
 158 conv    512       1 x 1/ 1     10 x  10 x1024 ->   10 x  10 x 512 0.105 BF
 159 conv   1024       3 x 3/ 1     10 x  10 x 512 ->   10 x  10 x1024 0.944 BF
 160 conv     24       1 x 1/ 1     10 x  10 x1024 ->   10 x  10 x  24 0.005 BF
 161 yolo
[yolo] params: iou loss: ciou (4), iou_norm: 0.07, cls_norm: 1.00, scale_x_y: 1.05
nms_kind: greedynms (1), beta = 0.600000 
Total BFLOPS 35.253 
avg_outputs = 289965 
 Allocate additional workspace_size = 131.24 MB 
Loading weights from backup/yolov4-hails_last.weights...
 seen 64, trained: 742 K-images (11 Kilo-batches_64) 
Done! Loaded 162 layers from weights-file 
Learning Rate: 0.001, Momentum: 0.949, Decay: 0.0005
 If error occurs - run training with flag: -dont_show 
Resizing, random_coef = 1.40 

 480 x 480 
 Create 6 permanent cpu-threads 
[xcb] Unknown request in queue while dequeuing
[xcb] Most likely this is a multi-threaded client and XInitThreads has not been called
[xcb] Aborting, sorry about that.
darknet: ../../src/xcb_io.c:165: dequeue_pending_request: Assertion `!xcb_xlib_unknown_req_in_deq' failed.
Aborted (core dumped)

no files called bad.list or bad_label.list

I do confirm aug_* data are with the correct bbox, as follow :

aug_1951854340_0_629545270

cfg file attached : yolov4-hailnet.txt

alexis-gruet-deel commented 4 years ago

What is bad.list and bad_label.list used for? I didn't have these files when I was training

I guess those files are created if a file as part of the train/val set is missing or may have corrupted label(s) ; better to ask @AlexeyAB

AlexeyAB commented 4 years ago

I guess those files are created if a file as part of the train/val set is missing or may have corrupted label(s) ;

Yes. If you don't have these files - then all is ok.

AlexeyAB commented 4 years ago

You anchors/masks are wrong. Train with default anchors. https://github.com/AlexeyAB/darknet#how-to-improve-object-detection

Only if you are an expert in neural detection networks - recalculate anchors for your dataset for width and height from cfg-file: darknet.exe detector calc_anchors data/obj.data -num_of_clusters 9 -width 416 -height 416 then set the same 9 anchors in each of 3 [yolo]-layers in your cfg-file. But you should change indexes of anchors masks= for each [yolo]-layer, so for YOLOv4 the 1st-[yolo]-layer has anchors smaller than 30x30, 2nd smaller than 60x60, 3rd remaining, and vice versa for YOLOv3. Also you should change the filters=(classes + 5)* before each [yolo]-layer. If many of the calculated anchors do not fit under the appropriate layers - then just try using all the default anchors.

alexis-gruet-deel commented 4 years ago

I generated those anchors from the darknet cmd. They were calculated from my dataset, did you seen something wrong in that ?

AlexeyAB commented 4 years ago

you should change indexes of anchors masks= for each [yolo]-layer

AlexeyAB / darknet

mAP dropped during training from 73% to 40% #6135