How can I resume the training with the latest checkpoint with wandb?

131250208 / TPlinker-joint-extraction

438 stars 94 forks source link

How can I resume the training with the latest checkpoint with wandb? #41

Closed jarork closed 3 years ago

jarork commented 3 years ago

I trained the model for over one day with wandb but sadly my training was over by network interruption. I've tried to train the model again but it didn't resume from the latest checkpoint but restarted.

131250208 commented 3 years ago

if you have saved the last model state, set the path to "model_state_dict_path" and set "fr_scratch" to "False", the model will load the state before training. Model states are all saved by default at "./wandb" or "./default_log_dir".

jarork commented 3 years ago

As I know, the checkpoints are only saved when it makes any improvements on rel_f1 or ent_f1; but it has been 10 hours since my last checkpoint was saved (no improvements during this 10hrs plateau period), is there a way to automatically save a checkpoint per x epochs without modifying the code? (I'd like to research if any significant improvements can be found after a long plateau period)

Many thanks!

131250208 commented 3 years ago

@jarork Then you have to modify the code on saving checkpoints. (make it save ckpt no matter there is any improvement or not)