Train Darknet yolov4 voc-based

lorenzobattelli commented 2 years ago

Dear all. I'm following the tutorial for object detection on a voc based yolov4 Darknet (https://github.com/Xilinx/Vitis-AI-Tutorials/tree/master/Design_Tutorials/07-yolov4-tutorial#31-darknet-model-training-on-voc) and trying to train the net, but this time using a gtsdb dataset (German traffic lights signs), with the command

./darknet detector train cfg/voc.data cfg/yolov4.cfg /yolov4.weights -map -dont_show -show_imgs

of course I edited the "voc.data" in order to point the right gtsdb files, I just forgot to rename that file. I edited the cfg files as requested, and the voc.data too. I'm working on a ubuntu VM (god..) I'd need some hints and answers about the training process: 1)After running the train command, should I stop it manually (Ctrl-C) just after I realized the training converged properly, or not? 2)Training convergence in this case means that the loss (or mAP ?) stops decreasing? I used -map parameter but I honestly don't understand where that information is. This is a piece of output:

v3 (iou loss, Normalizer: (iou: 0.07, obj: 1.00, cls: 1.00) Region 133 Avg (IOU: 0.419995), count: 34, class_loss = 3764.068115, iou_loss = 28.210449, total_loss = 3792.278564 
v3 (iou loss, Normalizer: (iou: 0.07, obj: 1.00, cls: 1.00) Region 144 Avg (IOU: 0.253230), count: 5, class_loss = 1026.847412, iou_loss = 0.271484, total_loss = 1027.118896 
v3 (iou loss, Normalizer: (iou: 0.07, obj: 1.00, cls: 1.00) Region 155 Avg (IOU: 0.000000), count: 1, class_loss = 268.681885, iou_loss = 0.000000, total_loss = 268.681885 
 total_bbox = 39, rewritten_bbox = 0.000000 % 

500504: 1169.513062, 1602.209351 avg loss, 0.000013 rate, 15818.413367 seconds, 32032256 images, 2120438.711756 hours left

3) MY MAIN PROBLEM: how can I save the weights during training? 3.1) and where can I find those .weights files? in the voc.data file I specified "backup = ./backup" I've let that process run for all night but still I can't see any weight file saved during training. Maybe is it just a matter of time?

4) In the output, which is the number of the current iteration? 4.1) 1 iteration == 1 epoch ?

Thank you for your time

code-locker commented 2 years ago

Hi @lorenzobattelli I will try my best to answer above questions.

Do not stop training procedure if loss sops decreasing, If you interrupt the training process using Ctrl+C, You may face issue some times later saying that out of memory issue. This is because interrupting the training process will fails to free the allocated memory.
I used mAP option once the training phase is completed. Once the trained model is generated, prepare the test set along with annotation files and perform the mAP operation which will display the overlapped regions for your input images and model detected region.
Weight files will be generated at the end of training phase. It will be present inside backup folder specified. It all depends on the configuration done by the user in the config parameters.
Epoch is total number of iteration of some batch size. For Ex: There are 1000 training images. If batch size is set to 100, each 100 images are trained for 10 iterations. If the epoch is set to 4, Each batch of 100 images is trained for 4*10 iterations.

I hope it is clear to you. Thanks

bhargavin1872008 commented 1 year ago

when running the requirements.txt of keras-yolov3-modelset -i 'm getting error for coremltools.it is showing like "couldn't find a version that satisfies the requirement tensorflow<=1.14 and tensorflow >=1.5(from tfcoremltools -r requirements.txt).(from version :2.2.0,2.2..1, 2.2.2, ...2.7.0rc0,2.7.0.rc1............) like this .can someone help me regarding this. Also ,i have a doubt .can we use ubuntu 20.04 ,cuda 11.7 ,cudnn 8.4.0 for this project. or have to use ubuntu 18.04,cuda 10.0 only which only works.please help me regarding this,i have less time in my hand.

bhargavin1872008 commented 1 year ago

Hi @lorenzobattelli I will try my best to answer above questions.

Do not stop training procedure if loss sops decreasing, If you interrupt the training process using Ctrl+C, You may face issue some times later saying that out of memory issue. This is because interrupting the training process will fails to free the allocated memory.

I used mAP option once the training phase is completed. Once the trained model is generated, prepare the test set along with annotation files and perform the mAP operation which will display the overlapped regions for your input images and model detected region.

Weight files will be generated at the end of training phase. It will be present inside backup folder specified. It all depends on the configuration done by the user in the config parameters.

Epoch is total number of iteration of some batch size. For Ex: There are 1000 training images. If batch size is set to 100, each 100 images are trained for 10 iterations. If the epoch is set to 4, Each batch of 100 images is trained for 4*10 iterations.

I hope it is clear to you. Thanks

when running the requirements.txt of keras-yolov3-modelset -i 'm getting error for coremltools.it is showing like "couldn't find a version that satisfies the requirement tensorflow<=1.14 and tensorflow >=1.5(from tfcoremltools -r requirements.txt).(from version :2.2.0,2.2..1, 2.2.2, ...2.7.0rc0,2.7.0.rc1............) like this.so, what version of -----------------coremltools--------------- is recommended to use for the project. or how did you resolved the above error.pleas help.your suggestion has utmost importance.

code-locker commented 1 year ago

Hi @bhargavin1872008 Which version of tensor flow your are using? Please install Tensorflow 1.5 or grater.

bhargavin1872008 commented 1 year ago

What about coremltools

On Sat, Aug 20, 2022, 23:12 code-locker @.***> wrote:

Hi @bhargavin1872008 https://github.com/bhargavin1872008 Which version of tensor flow your are using? Please install Tensorflow 1.5 or grater.

— Reply to this email directly, view it on GitHub https://github.com/Xilinx/Vitis-AI-Tutorials/issues/32#issuecomment-1221374990, or unsubscribe https://github.com/notifications/unsubscribe-auth/A2CH5GIHCUO6O3UNBUTRMTDV2EKJPANCNFSM5FQWSIJQ . You are receiving this because you were mentioned.Message ID: @.***>

code-locker commented 1 year ago

What about coremltools … On Sat, Aug 20, 2022, 23:12 code-locker @.> wrote: Hi @bhargavin1872008 https://github.com/bhargavin1872008 Which version of tensor flow your are using? Please install Tensorflow 1.5 or grater. — Reply to this email directly, view it on GitHub <#32 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/A2CH5GIHCUO6O3UNBUTRMTDV2EKJPANCNFSM5FQWSIJQ . You are receiving this because you were mentioned.Message ID: @.>

I don't think it is issue from coremltools. Please check with installing tensorflow version. Still if you are facing issue attach the screenshot for understanding more on your issue.

Xilinx / Vitis-AI-Tutorials

Train Darknet yolov4 voc-based #32