Open kaiw7 opened 3 months ago
Hi, I noticed that there maybe some issues when resuming the latest checkpoints. In the training script, only the 'EMA' checkpoint is saved each time and the 'model' checkpoint is not saved. If the running job is interrupted, the 'EMA' checkpoint of recent training steps is loaded to initialize the model.state_dict. I am not sure if it is correct, because generally the 'model' checkpoint should be loaded for model.state_dict. In addition, when to re-run the scrip for resuming training, the 'EMA' is again initialized with random model state_dict. I am wondering if the 'EMA' should be initialized with the 'EMA' checkpoint of recent training steps?
So, I would like to make sure if there are differences between the whole training (without interruption) and resuming the training due to the interruption? Many thanks
Hi, ema
is not initialized randomly; it synchronizes the parameters of the model. Please see here.
Hi, I noticed that there maybe some issues when resuming the latest checkpoints. In the training script, only the 'EMA' checkpoint is saved each time and the 'model' checkpoint is not saved. If the running job is interrupted, the 'EMA' checkpoint of recent training steps is loaded to initialize the model.state_dict. I am not sure if it is correct, because generally the 'model' checkpoint should be loaded for model.state_dict. In addition, when to re-run the scrip for resuming training, the 'EMA' is again initialized with random model state_dict. I am wondering if the 'EMA' should be initialized with the 'EMA' checkpoint of recent training steps? So, I would like to make sure if there are differences between the whole training (without interruption) and resuming the training due to the interruption? Many thanks
Hi,
ema
is not initialized randomly; it synchronizes the parameters of the model. Please see here.
Sorry, I mean the ema do not load the saved ema checkpoint
resume_from_checkpoint
is a TODO function, which is not perfect yet. So check it carefully if you want to use it.
Hi, I noticed that there maybe some issues when resuming the latest checkpoints. In the training script, only the 'EMA' checkpoint is saved each time and the 'model' checkpoint is not saved. If the running job is interrupted, the 'EMA' checkpoint of recent training steps is loaded to initialize the model.state_dict. I am not sure if it is correct, because generally the 'model' checkpoint should be loaded for model.state_dict. In addition, when to re-run the scrip for resuming training, the 'EMA' is again initialized with random model state_dict. I am wondering if the 'EMA' should be initialized with the 'EMA' checkpoint of recent training steps?
So, I would like to make sure if there are differences between the whole training (without interruption) and resuming the training due to the interruption? Many thanks