AlexeyAB / darknet

YOLOv4 / Scaled-YOLOv4 / YOLO - Neural Networks for Object Detection (Windows and Linux version of Darknet )
http://pjreddie.com/darknet/
Other
21.75k stars 7.96k forks source link

Loss spike when training is restarted with multi-GPU #3365

Open ShantanuNathan opened 5 years ago

ShantanuNathan commented 5 years ago

Hi @AlexeyAB ,

I've been experimenting with multi-GPU training, following the steps that you had suggested:

https://github.com/AlexeyAB/darknet#how-to-train-with-multi-gpu

And I notice that when I stop the initial training with the single GPU (lets say after 1000 or 2000 steps) and restart the training with multiple or even with a single GPU, the loss starts from a higher number than from when the initial training was stopped(when saving at 1000 or 2000 steps).

Could you provide an explanation for this? Is there any way I could prevent this? I have even encountered a case when when I restarted training in a multi-GPU machine the loss spike up so much that I ended up getting nan loss and avg loss.

Thanks

AlexeyAB commented 5 years ago

@ShantanuNathan Hi,

Could you provide an explanation for this? Is there any way I could prevent this?

This is normal. Due to momentum= in cfg-file - It uses a running average of the gradients https://github.com/AlexeyAB/darknet/issues/1943 Usually you shouldn't prevent this.

I have even encountered a case when when I restarted training in a multi-GPU machine the loss spike up so much that I ended up getting nan loss and avg loss.

This is strange. What model do you use?