Worse results after repository update

gilmartinspinheiro commented 4 years ago

Hi @AlexeyAB ,

I have two different repositories, one is older than the other (cannot specify how old, but prior to the yoloV4 update).

I was not getting acceptable results with yolov4, even though I was using the same dataset that was used for a past successful yolov3 train.

The next step I tried, was to reproduce the successful train done with the old repository, now on the updated one. So, I copied the yolov3 config used in the past and trained with it in the new repository, while maintaining the dataset. What happened was that I could not achieve the same results as the ones obtained before.

Is there any difference beyond the config file and the dataset that could potentially cause this difference?

Thanks in advance

AlexeyAB commented 4 years ago

What version of Darknet do you use currently, what date? Show chart.png with Loss and mAP for all cases. What mAP did you get previously and now?

gilmartinspinheiro commented 4 years ago

Due to logisitic and timing reasons, the trains were done without validation. I will repeat both and try to provide you with full details in a few days, when the new trains are complete. Thank you for your time!

gilmartinspinheiro commented 4 years ago

Hi again @AlexeyAB,

Sorry for the late reply, but I have been conducting some tests on my costum dataset and they took some time.

So, basically I have tested with 3 different repositories:

The most recent version of darknet (lets call it Alexey_new)
An old repo version, on this link: https://github.com/AlexeyAB/darknet/tree/a7a2e1bb4b0efa55ac2af91358e8c8d2d20076a7 . Also, this version did not have the draw MAP functionality, so I wont be able to provide you with that. (lets call it Alexey_old)
An old pjreddie repo - the one I cannot specify the date, although it is surely close to pjreddie latest commit. This version also does not have the draw MAP feature. (lets call it pjreddie_old)

All of the following results were obtained with the same train data and test data (1122 imgs) . I will present you the results for a yolov3 default config file for the 3 repos. Also, I will show you results from a yolov4-custom default config. All trains were done with an excessively large max batch size to reproduce training conditions, since the first train was done that way by accident.

For Alexey_new with yolov3:

The charts are the following (it was necessary to stop the training and restart later, so there are 2 charts): chart_1 chart_2

Running our testing script: FP | FN 72 | 98 Obs. There are a significant amount of low confidence predictions.

For Alexey_old yolov3:

Running our testing script (weight 13900): FP | FN 36 | 68 Obs. There are also a significant amount of low confidence predictions

For pjreddie_old yolov3:

Running our testing script (weight 14000): FP | FN 27 | 37 Obs. There are less low confidence predictions. The best result obtained.

For Alexey_new yolov4:

The charts are the following: chart_yolov4-custom

Running our testing script (weight 13000): FP | FN 6971 | 7119

gilmartinspinheiro commented 4 years ago

yolov4.txt

gilmartinspinheiro commented 4 years ago

yolov3.txt

AlexeyAB commented 4 years ago

For Alexey_new with yolov3: For Alexey_old yolov3: For pjreddie_old yolov3: For Alexey_new yolov4:

Did you train on New repo for all 4 cases, and only tested on 3 different repos?
How many test images do you have?
Run training with flag -show_imgs do you see correct bboxes? Can you show 1-2 examples?
Show content of obj.data file
It looks like you trained Yolov3 with valid=train.txt, while trained Yolov4 with valid=test.txt in obj.data file
Show anchors and cloud of points by using command: ./darknet detector calc_anchors data/obj.data -num_of_clusters 9 -width 576 -height 576 -show
I don't know anything about your testing script, but this is strange that AP50 is lower for v4 than for v3.
There was some issue with mosaic=1 from 1 Jun 2020 to 7 Jun 2020, so if you used this version, try to download the latest Darknet version and train yolov4 again
Also try to set subdivisions=32 or better 16 in cfg-file, and show chart.png

screenshots with such information

./darknet detector test cfg/coco.data cfg/yolov4.cfg yolov4.weights data/dog.jpg
CUDA-version: 10000 (10000), cuDNN: 7.4.2, CUDNN_HALF=1, GPU count: 1
CUDNN_HALF=1
OpenCV version: 4.2.0
0 : compute_capability = 750, cudnn_half = 1, GPU: GeForce RTX 2070
net.optimized_memory = 0
mini_batch = 1, batch = 8, time_steps = 1, train = 0
layer   filters  size/strd(dil)      input                output
0 conv     32       3 x 3/ 1    608 x 608 x   3 ->  608 x 608 x  32 0.639 BF

gilmartinspinheiro commented 4 years ago

Did you train on New repo for all 4 cases, and only tested on 3 different repos?

No. What I basically did was using the same config across different repos and obtained different results in each one of them. Obviously, when using yolov4, the only repo I could use was the more recent one.

How many test images do you have?

1122 testing images.

Run training with flag -show_imgs do you see correct bboxes? Can you show 1-2 examples?

Sorry, but I am not allowed to share any images from my dataset, unfortunately. But I have already ran the training code with -show_imgs and I have already checked the BBoxes.

Show content of obj.data file & -It looks like you trained Yolov3 with valid=train.txt, while trained Yolov4 with valid=test.txt in obj.data file

That is not the case. The valid and train txt's are correct. obj.data: classes = 12 train = weights/train_name/train.txt validation = weights/train_name/val.txt names = weights/train_name/obj.names backup = weights/train_name/backup/

I actually noticed that I had an error in .data file (wrote "validation=" instead of "valid=" and the validation dataset was defaulting to the training dataset). With that said, that error existed in both cases, for yolov3 and v4. This can be seen in the last image of this comment.

Show anchors and cloud of points by using command: ./darknet detector calc_anchors data/obj.data -num_of_clusters 9 -width 576 -height 576 -show

With the most recent repo: calc_anch

I also had around 40 wrong labels (negative values, they are discarded right?) in 25k training images, which I guess is not particularly problematic (?). calc_anch0

With the alexeyAB_old repo: clusters_screenshot_08 06 2020

I don't know anything about your testing script, but this is strange that AP50 is lower for v4 than for v3.

It is not only the script, the lower performance can also be seen by inspecting the yolov4 and v3 predictions.

There was some issue with mosaic=1 from 1 Jun 2020 to 7 Jun 2020, so if you used this version, try to download the latest Darknet version and train yolov4 again

So i guess that might actually be the reason for yolov4 poor results (?)

Also try to set subdivisions=32 or better 16 in cfg-file, and show chart.png

I cannot increase batch size due to memory limitations. Since yolov4 occupies more memory, I had the necessity to reduce the batch size.

screenshots with such information

gilmartinspinheiro commented 4 years ago

I found another error in the train file, it had half the lines deleted. I will train Yolov4 again with the correct train file, alongside your suggestions and report the result back to you!

Thank you for your help!

AlexeyAB commented 4 years ago

Yes, it looks like you are using different datasets for training/validation for v3 and v4.

Train and test both v3 and v4 on the same dataset with the same command.

gilmartinspinheiro commented 4 years ago

Hello again @AlexeyAB ,

So, I made new trains with the most recent repository (at the time, 7 days ago. After the mosaic fix, i guess).

The results are still far better for yolo V3 than yolo v4.
Also, in yolo V3, are now closer to the results of the old repo (which I called pjreddie_old), although they are still worse and the config was now changed for optimal performance. That is, I used more data augmentation, trained anchors and also a smaller max iteration number.
The cfg's will be attached for you to check them, as well as the charts.
How can I train anchors for yolov4? Is the process and the cautions to have the same as before?
yolo V4 Chart:

For my validation script results: FP | FN 701 | 1871

yolo-v4.txt

yolo v3 Chart

For my validation script results: FP | FN 19 | 54

yolo-v3.txt

AlexeyAB commented 4 years ago

The problem is that you are training models with different parameters.

Is this accuracy on training or validation dataset?
Train both models with the same anchors, and the same random=0 param, and subdivisions=32 for yolov4.cfg
Show tp, fp, fn for both models using ./darknet detector map ... command for both training and validation dataset

gilmartinspinheiro commented 4 years ago

Is this accuracy on training or validation dataset? Validation dataset.
Train both models with the same anchors, and the same random=0 param, and subdivisions=32 for yolov4.cfg

How can I use the same anchors? Are they not set up differently in v4 and v3? At least, mask order is different. Can I copy the masks and anchors directly from v3 to v4 without a problem?

AlexeyAB commented 4 years ago

Don't change masks. Use anchors anchors = 12, 16, 19, 36, 40, 28, 36, 75, 76, 55, 72, 146, 142, 110, 192, 243, 459, 401

gilmartinspinheiro commented 4 years ago

Should I use those anchors on v3 also? Because those are the default on V4, but they are different from V3. Sorry, but I did not understand what you wat me to do regarding the config files.

AlexeyAB commented 4 years ago

The anchors in your yolov3.txt file are different from the default anchors from yolov3.cfg

gilmartinspinheiro commented 4 years ago

@AlexeyAB I am still confused. So, to sum up: You are suggesting I should train yolo v3 and v4 with all default parameters for each one? Including default anchors for each one?

Or are you suggesting that I use yolov4 anchors (not changing the masks) on yolov3?

AlexeyAB / darknet

Worse results after repository update #5823