Hzfinfdu / Diffusion-BERT

ACL'2023: DiffusionBERT: Improving Generative Masked Language Models with Diffusion Models
Apache License 2.0
295 stars 25 forks source link

Resuming training via `--load_step` #30

Open justinchiu opened 10 months ago

justinchiu commented 10 months ago

Thanks for the code release!

Heads up for other users who want to resume training from a checkpoint: you will want to

  1. de-indent DDP_main.py:80 so that all devices can load the checkpoint
  2. load the optimizer and scheduler states on line DDP_main:146
  3. set the index of the dataloader to the correct example before actually training

I'm not totally sure this solves everything like logging, but might work ok.

Note: There's also a separate issue that your checkpoints might get overwritten between epochs, so be sure you're loading the right thing and saving where you want.