AlexeyAB / darknet

YOLOv4 / Scaled-YOLOv4 / YOLO - Neural Networks for Object Detection (Windows and Linux version of Darknet )
http://pjreddie.com/darknet/
Other
21.65k stars 7.96k forks source link

There seems to be a bug in the new version of darknet, can AlexeyAB come and have a look? #7028

Open 1027663760 opened 3 years ago

1027663760 commented 3 years ago

Training my data on the new version + cuda11.1 is abnormal

Let me first describe the test results: When compiling with the old version + cuda10.2, after 500 iterations, the loss dropped to about 0.2, which was in line with expectations When compiling with the new version + cuda11.1, after 1000 iterations, the loss has stopped at about 0.4 and did not drop, and the values such as iou F1 did not meet expectations

Then I guessed that there might be a bug in cuda11.1, and then I conducted a test like this: When compiling with the old version + cuda11.1, it meets expectations

The cfg file and data set used in the above test are the same So I came to the conclusion: the new version of darknet has a bug

@AlexeyAB

1027663760 commented 3 years ago

New version used: https://github.com/AlexeyAB/darknet/tree/181967937ddfdd86c3f2c1259be45cc66880225e Old version used: https://github.com/AlexeyAB/darknet/tree/be906dfa0e1d24f5ba61963d16dd0dd00b32f317

AlexeyAB commented 3 years ago

Can you try to compare mAP instead of Loss? What cfg-file do you use?

1027663760 commented 3 years ago

Can you try to compare mAP instead of Loss? What cfg-file do you use?

yolov4-tiny_3l.cfg and yolov4-tiny.cfg map does not meet expectations

AlexeyAB commented 3 years ago

yolov4-tiny_3l.cfg and yolov4-tiny.cfg map does not meet expectations

Show values.

1027663760 commented 3 years ago

yolov4-tiny_3l.cfg and yolov4-tiny.cfg map does not meet expectations

Show values.

I need to re-train and look at the detailed values

In addition to the above test, I only remember that when train.txt=valid.txt, the number of valid.txt detected is only half of train.txt

aleproietti commented 3 years ago

I think I'm having the same problem. I started training with this version:

https://github.com/AlexeyAB/darknet/tree/be906dfa0e1d24f5ba61963d16dd0dd00b32f317

and training was normal (except map calculation wasn't synchronized with each 100 iterations.

Then continued with this version

https://github.com/AlexeyAB/darknet/tree/333cc14a06ca903afa6bbcb67bd2d88222f75cc7

and best_weights and map calculation was very often (some minutes apart), but training took too long (approx 2/3 hours for 100 iterations) and loss did not decrease.

Then tried with this version:

https://github.com/AlexeyAB/darknet/tree/181967937ddfdd86c3f2c1259be45cc66880225e

and the best_weights problem was solved, but the training time and loss issues remained.

This is the chart with loss with first version:

chart (0)

This is the chart and loss when I tried to continue with the training (I did 200 iterations and loss did not decrease):

chart (1)

AlexeyAB commented 3 years ago

@aleproietti

and the best_weights problem was solved, but the training time and loss issues remained.

What CPU do you use? What cfg-file do you use? Did you compile with OpenCV?

aleproietti commented 3 years ago

@AlexeyAB

I'm using Google Colab to train; this way, with OPENCV and Colab's GPU

cfg is this one (yolov4):

yolov4-obj.txt

1027663760 commented 3 years ago

yolov4-tiny_3l.cfg and yolov4-tiny.cfg map does not meet expectations

Show values.

After several hours of training and testing, I found that although the loss of the new version is stuck at 0.4-0.5 and does not decrease, it does not seem to affect the training results

@AlexeyAB 1 chart

AlexeyAB commented 3 years ago

@1027663760 So this does not affect the training process. Affects only the loss display.

1027663760 commented 3 years ago

@1027663760 So this does not affect the training process. Affects only the loss display.

According to my test results, the new version will only affect the display of loss, and will not make the training results worse.

AlexeyAB commented 3 years ago

@aleproietti

and the best_weights problem was solved, but the training time and loss issues remained.

I'm using Google Colab to train; this way, with OPENCV and Colab's GPU

The latest Darknet version is optimized for multi-core CPUs (8 - 64) for pre/post-processing, while GoogleColab uses only 2 Logical CPU Cores. Maybe this is the reason.

aleproietti commented 3 years ago

Ok, noted.

Thank you @AlexeyAB

dek8v5 commented 3 years ago

I have been training my own dataset with 3 classes, average loss always starting at 4600s and plateaued at 30-40s. Is that not normal? I thought every dataset has its own training loss and map values. chart_yolo-obj

AlexeyAB commented 3 years ago

I have been training my own dataset with 3 classes, average loss always starting at 4600s and plateaued at 30-40s. Is that not normal?

This is normal.

I thought every dataset has its own training loss and map values.

Yes. Every dataset and every model has its own training loss and map values.

abdulghani91 commented 3 years ago

where can I found the link for the latest version of the darknet, please?

AlexeyAB commented 3 years ago

@abdulghani91

abdulghani91 commented 3 years ago

@AlexeyAB I use the same version but it does not save weight file every 1000 iteration, and I did this change on line 385 in the detector file: if ((iteration >= (iter_save + 1000) || iteration % 1000 == 0) || (iteration >= (iter_save + 1000) || iteration % 1000 == 0) && net.max_batches < 1000)

the original was: if ((iteration >= (iter_save + 10000) || iteration % 10000 == 0) || (iteration >= (iter_save + 1000) || iteration % 1000 == 0) && net.max_batches < 10000)

Is the change right, and according to AlexeyAB/darknet@b5ff7f4 it saves every 1000 iteration, but I try it and it does not work, unless I did the change. for the condition that I change it, it's ok to train with it, or it's wrong, I'm training with YOLOv4-CSP

AlexeyAB commented 3 years ago
abdulghani91 commented 3 years ago

@AlexeyAB ok, thank you