About Training Epochs & GPU Memory

user20421 commented 3 days ago

Thank you very much for your excellent work.I would like to build on your work and try some new ideas. However, I am currently encountering some issues. Could you please provide me with some suggestions or solutions? I would be very grateful for your help. 1.When I try to reproduce the YOLOv-S and YOLOv++-S models on a single 3080Ti GPU following your configuration, I sometimes encounter a "CUDA out of memory" error. Using nvidia-smi, I noticed that the GPU memory usage fluctuates significantly, which is uncommon in typical deep learning scenarios. Could this behavior be related to the multi-scale training mode or EMA training mode? 2.Regarding the first issue, when I attempt multi-GPU training, the program gets stuck and eventually throws a timeout error. Is there any way to enable single-machine multi-GPU training? 3.In your YOLOv-S experiments, the maximum epoch is set to 7. I noticed that the loss remains high near the end of the training. I tried increasing the maximum epoch to 14, but the loss is still quite large.Does having a high loss significantly impact the training results? Should I train for more epochs or reduce the learning rate to improve the performance?

YuHengsss commented 2 days ago

For the first question, the GPU memory usage is related to the candidate proposal numbers, you may set the maximum limitation as a hyper-parameter in (300 will be ok for 11G GPU memory): https://github.com/YuHengsss/YOLOV/blob/5f069b29e201c4e099c7c3827cc6c63823ce3141/exps/yolov%2B%2B/v%2B%2B_base_decoupleReg_2x.py#L44

For the second question, due to the limited computational cost and my limited coding ability when I try Video Object Detection, the multi-GPU training is not well supported.

For the third question, there are some losses belonging to the base detector that is not optimized (see here). In my experience on ImageNet VID dataset, the best performance could be obtained in three or four epochs and will decrease then.

user20421 commented 1 day ago

Thank you very much for your suggestions and for the outstanding work you have provided.

YuHengsss / YOLOV

About Training Epochs & GPU Memory #112