When should I stop training if the validation metrics doesn't go down?

AlexeyAB / darknet

YOLOv4 / Scaled-YOLOv4 / YOLO - Neural Networks for Object Detection (Windows and Linux version of Darknet )

http://pjreddie.com/darknet/

Other

21.71k stars 7.96k forks source link

When should I stop training if the validation metrics doesn't go down? #807

Open yangulei opened 6 years ago

yangulei commented 6 years ago

I train my model on Ubuntu 16.04 with the command below: darknet detector train <data_file> <cfg_file> darknet19_448.conv.23 | tee log.txt

Here is my learning rate strategy in the cfg_file:

learning_rate=0.0001
max_batches = 90000
policy=steps
steps=200,50000,70000
scales=10,.1,.1

And here is a chart during the training: chart

After the training, I use my Python script to validate the models with different training steps. The script runs the command: darknet detector map <data_file> <cfg_file> <weight_file> 1>log_file

and parses the output to get the metrics, then plots them out. Here is a plot I got: metrics

which shows that the metrics doesn't go down like this plot: metrics So, when should I stop training, or which weight file should I chose?

BTW: my Python script is ugly, while it does works.

AlexeyAB commented 6 years ago

Did you get mAP for validation dataset - images that weren't used during training? If yes, then you can use any weights-file since ~54 000 iterations. I.e. get any weights-file with the highest mAP. (Or weights-file with highest mAP, Precision and Recall)

which shows that the metrics doesn't go down like this plot:

Overfitting is rare for Yolo v3/v2. It can be only:

if you use very low number of images and train many iterations,
or high number of similar images with a different distribution than in the validation dataset,
and set wrong params for data augmentation or learning rate

yangulei commented 6 years ago

Thank you for your timing reply! Yes, I do randomly split my ~8000 image samples into training and validation dataset with a ratio of 8:2, and validate the models with the validation dataset after training. Looks like I could reduce the training iterations to save some training time. Thank you again, I can proceed without the worry about overfitting now.

EscVM commented 6 years ago

@yangulei Looking at your learning strategy... What're the purpose of scales and steps? Thx

yangulei commented 6 years ago

@EscVM It's a learning rate (LR) schedule. The parameter "steps" refers to the incremental steps to adjust the LR, and the parameter "scales" refers to the multipliers, see the answer on stackoverflow. In my personal opinion, the schedule is aimed to balance the computation time and the convergence accuracy. You can find more detail about this at CS231n and the YOLOv1 paper.

EscVM commented 6 years ago

@yangulei Thank you. Gotcha!

Insead, I've tried your code, but I got this error:

`--------------------------------------- KeyErrorTraceback (most recent call last)

in () 75 metrics = pd.read_hdf(h5_name, 'metrics') 76 else: ---> 77 metrics = get_metrics() 78 79 metrics_select = metrics[["F1","IoU","mAP","precision","recall"]] in get_metrics() 48 k, v = iterm.split("=") 49 metrics_dict.update({k:v}) ---> 50 mAPs.append(float(metrics_dict["mAP"])) 51 precisions.append(float(metrics_dict["precision"])) 52 recalls.append(float(metrics_dict["recall"])) KeyError: 'mAP`

yangulei commented 6 years ago

@EscVM Sorry for reply so late. Did you change the darknet executable, the data path, the config path and the weights path to your own in get_metrics() function? It's line 11 to 17 in the script.