Training YOLO V3 stops at random iteration

MostafaMohamedEr commented 3 years ago

If you have an issue with training - no-detections / Nan avg-loss / low accuracy:

read FAQ: https://github.com/AlexeyAB/darknet/wiki/FAQ---frequently-asked-questions
what command do you use? *darknet.exe detector train cfg\Person.data cfg\yolov3TrainPerson.cfg weights\darknet53.conv.74
what dataset do you use?
I use 10k images of open image Person's class
check your dataset - run training with flag -show_imgs i.e. ./darknet detector train ... -show_imgs and look at the aug_...jpg images, do you see correct truth bounded boxes?
Yes
rename your cfg-file to txt-file and drag-n-drop (attach) to your message here
yolov3TrainPerson.txt
show such screenshot with info

Each time I run the training command, it stops at random iterations such as 63,136 and 330. I checked the data, it is correct

Any advice??

ekesdf commented 3 years ago

It stops bc your model should be perfectly trained bc your rewritten_bbox is like over a 100% so just test it if it makes the result you want and try running your command with -map at the end of your cmd this will calculate the models' accuracy each couple hundreds epochs like every 200 epochs

MostafaMohamedEr commented 3 years ago

It stops bc your model should be perfectly trained bc your rewritten_bbox is like over a 100% so just test it if it makes the result you want and try running your command with -map at the end of your cmd this will calculate the models' accuracy each couple hundreds epochs like every 200 epochs

Thanks for reply, but it still stops, and it doesn't reach high number of epochs

tcwhalen commented 3 years ago

A late response, but in case anyone else stumbles upon this:

It stops bc your model should be perfectly trained bc your rewritten_bbox is like over a 100%

His rewritten_bbox is 1%, not 100%, and if I understand correctly, we expect it to be between 0 and 5% (lower is better). So that looks fine, but it's not an indication that the model is well-trained.

parthlathiya26112 commented 2 years ago

I have solved the issue by looking into the following steps, This step is suggested by the "https://github.com/AlexeyAB/darknet" when I was raising the new issue for the training on git hub :

If you have an issue with training - no-detections / Nan avg-loss / low accuracy:

read FAQ: https://github.com/AlexeyAB/darknet/wiki/FAQ---frequently-asked- questions
Check if the command is correct
Dataset is correct or not, means bounding boxes and classes index
check your dataset - run training with flag -show_imgs i.e. ./darknet detector train ... -show_imgs and look at the aug_...jpg images, do you see correct truth bounded boxes?
Check cfg-file it has correct values 6. check bad.list and bad_label.list for error, if they exist.
Read How to train (to detect your custom objects) and How to improve object detection in the Readme: "https://github.com/AlexeyAB/darknet/blob/master/README.md"

Mine was solved by the 6th step which is highlighted, In that file, I have seen two issues in each one of them:

Issue 1: "train.txt" contains the file which is deleted by myself for the wrong annotation like "0 0 0 0 0"

Issue 2: I have only 1 class as output prediction but one of the text files contain class index 15 so I removed the 15 and place 0 to each place and save the file.

bad.list and bad_label.list those files will give you also the error and file name by which the error has occurred.

tips: Do check bad.list and bad_label.list even if you dont get error on display it will tell you the hidden errors

thanks have fun and Make amazing things

AlexeyAB / darknet

Training YOLO V3 stops at random iteration #7665