Closed travel-go closed 7 years ago
I ended up at model_epoch13.th7 and I want to continue my model training
I assume the aborted process created some checkpoints? If you re-start the training with the same -savedir
(and command-line arguments), it will automatically resume from the last checkpoint.
CUDA_VISIBLE_DEVICES=0,3 fairseq train -sourcelang en -targetlang de -datadir data-bin/news_bpe_2014 -model fconv -nenclayer 15 -nlayer 15 -fconv_nhids 512,512,512,512,512,512,512,512,512,512,768,768,768,2048,2048 -fconv_nlmhids 512,512,512,512,512,512,512,512,512,512,768,768,768,2048,2048 -fconv_kwidths 3,3,3,3,3,3,3,3,3,3,3,3,3,1,1 -fconv_klmwidths 3,3,3,3,3,3,3,3,3,3,3,3,3,1,1 -dropout 0.2 -optim nag -lr 0.25 -clip 0.1 -momentum 0.99 -timeavg -bptt 0 -savedir trainings/final -validbleu -batchsize 48 -maxbatch 1200 & This is my training command.Should I re-use this order, or do I need to add other parameters?
It should be fine to re-start it exactly like this. After startup, the program should print something like Found existing state, attempting to resume training
immediately.
model_best_opt.th7 model_epoch12.th7 model_epoch3.th7 model_epoch7.th7 state_epoch11.th7 state_epoch2.th7 state_epoch6.th7 state_last.th7 model_best.th7 model_epoch13.th7 model_epoch4.th7 model_epoch8.th7 state_epoch12.th7 state_epoch3.th7 state_epoch7.th7 model_epoch10.th7 model_epoch1.th7 model_epoch5.th7 model_epoch9.th7 state_epoch13.th7 state_epoch4.th7 state_epoch8.th7 model_epoch11.th7 model_epoch2.th7 model_epoch6.th7 state_epoch10.th7 state_epoch1.th7 state_epoch5.th7 state_epoch9.th7 This is my saved training model.Thank you very much
Wow, that's great! Thank for your replay.
FYI, the crucial one for resuming is state_last.th7
.
Hello,I am a newbie. When I was training the model, I accidentally closed the process. How could I skip the previous training model?