Loss spike when training is restarted with multi-GPU

AlexeyAB / darknet

YOLOv4 / Scaled-YOLOv4 / YOLO - Neural Networks for Object Detection (Windows and Linux version of Darknet )

Other

21.75k stars 7.96k forks source link

Hi @AlexeyAB ,

I've been experimenting with multi-GPU training, following the steps that you had suggested:

https://github.com/AlexeyAB/darknet#how-to-train-with-multi-gpu

And I notice that when I stop the initial training with the single GPU (lets say after 1000 or 2000 steps) and restart the training with multiple or even with a single GPU, the loss starts from a higher number than from when the initial training was stopped(when saving at 1000 or 2000 steps).

Could you provide an explanation for this? Is there any way I could prevent this? I have even encountered a case when when I restarted training in a multi-GPU machine the loss spike up so much that I ended up getting nan loss and avg loss.

Thanks

AlexeyAB / darknet

Loss spike when training is restarted with multi-GPU #3365