NVlabs / Deep_Object_Pose

Deep Object Pose Estimation (DOPE) – ROS inference (CoRL 2018)
Other
1.03k stars 287 forks source link

Training at starting epoch #250

Closed CorbinFerrie closed 2 years ago

CorbinFerrie commented 2 years ago

Hello, I was wondering if it was possible to begin training starting at a previous epoch. I am running into stability issues on my PC which causes the training script (my entire PC) to crash randomly at times. For example, if I trained up to 20 epochs and my PC crashes, is it possible to begin training from the 20th epoch vs starting over again? This would save me days of headache. Thanks

mintar commented 2 years ago

Yes! The training script saves the weights file (*.pth) after each epoch. You can resume training by adding --net <path-to-latest-pth-file> to your training script command.

AFAIK, this is not 100% the same as letting it run continuously, because the ADAM parameters will be re-initialized. This means that for the next few epochs after resuming, the loss will go up a bit and then go down again, but in my experience, that effect is relatively minor. I've resumed training often and never had a problem.

CorbinFerrie commented 2 years ago

Thanks for the quick reply!