Open ShantanuNathan opened 5 years ago
@ShantanuNathan Hi,
Could you provide an explanation for this? Is there any way I could prevent this?
This is normal. Due to momentum=
in cfg-file - It uses a running average of the gradients https://github.com/AlexeyAB/darknet/issues/1943
Usually you shouldn't prevent this.
I have even encountered a case when when I restarted training in a multi-GPU machine the loss spike up so much that I ended up getting nan loss and avg loss.
This is strange. What model do you use?
Hi @AlexeyAB ,
I've been experimenting with multi-GPU training, following the steps that you had suggested:
https://github.com/AlexeyAB/darknet#how-to-train-with-multi-gpu
And I notice that when I stop the initial training with the single GPU (lets say after 1000 or 2000 steps) and restart the training with multiple or even with a single GPU, the loss starts from a higher number than from when the initial training was stopped(when saving at 1000 or 2000 steps).
Could you provide an explanation for this? Is there any way I could prevent this? I have even encountered a case when when I restarted training in a multi-GPU machine the loss spike up so much that I ended up getting nan loss and avg loss.
Thanks