Heads up for other users who want to resume training from a checkpoint: you will want to
de-indent DDP_main.py:80 so that all devices can load the checkpoint
load the optimizer and scheduler states on line DDP_main:146
set the index of the dataloader to the correct example before actually training
I'm not totally sure this solves everything like logging, but might work ok.
Note: There's also a separate issue that your checkpoints might get overwritten between epochs, so be sure you're loading the right thing and saving where you want.
Thanks for the code release!
Heads up for other users who want to resume training from a checkpoint: you will want to
I'm not totally sure this solves everything like logging, but might work ok.
Note: There's also a separate issue that your checkpoints might get overwritten between epochs, so be sure you're loading the right thing and saving where you want.