About resume checkpoint

kaiw7 commented 3 months ago

Hi, I noticed that there maybe some issues when resuming the latest checkpoints. In the training script, only the 'EMA' checkpoint is saved each time and the 'model' checkpoint is not saved. If the running job is interrupted, the 'EMA' checkpoint of recent training steps is loaded to initialize the model.state_dict. I am not sure if it is correct, because generally the 'model' checkpoint should be loaded for model.state_dict. In addition, when to re-run the scrip for resuming training, the 'EMA' is again initialized with random model state_dict. I am wondering if the 'EMA' should be initialized with the 'EMA' checkpoint of recent training steps?

So, I would like to make sure if there are differences between the whole training (without interruption) and resuming the training due to the interruption? Many thanks

maxin-cn commented 3 months ago

Hi, I noticed that there maybe some issues when resuming the latest checkpoints. In the training script, only the 'EMA' checkpoint is saved each time and the 'model' checkpoint is not saved. If the running job is interrupted, the 'EMA' checkpoint of recent training steps is loaded to initialize the model.state_dict. I am not sure if it is correct, because generally the 'model' checkpoint should be loaded for model.state_dict. In addition, when to re-run the scrip for resuming training, the 'EMA' is again initialized with random model state_dict. I am wondering if the 'EMA' should be initialized with the 'EMA' checkpoint of recent training steps?

So, I would like to make sure if there are differences between the whole training (without interruption) and resuming the training due to the interruption? Many thanks

Hi, ema is not initialized randomly; it synchronizes the parameters of the model. Please see here.

kaiw7 commented 3 months ago

Hi, I noticed that there maybe some issues when resuming the latest checkpoints. In the training script, only the 'EMA' checkpoint is saved each time and the 'model' checkpoint is not saved. If the running job is interrupted, the 'EMA' checkpoint of recent training steps is loaded to initialize the model.state_dict. I am not sure if it is correct, because generally the 'model' checkpoint should be loaded for model.state_dict. In addition, when to re-run the scrip for resuming training, the 'EMA' is again initialized with random model state_dict. I am wondering if the 'EMA' should be initialized with the 'EMA' checkpoint of recent training steps? So, I would like to make sure if there are differences between the whole training (without interruption) and resuming the training due to the interruption? Many thanks

Hi, ema is not initialized randomly; it synchronizes the parameters of the model. Please see here.

Sorry, I mean the ema do not load the saved ema checkpoint

resume_from_checkpoint is a TODO function, which is not perfect yet. So check it carefully if you want to use it.

Vchitect / Latte

About resume checkpoint #55