Unbabel / OpenKiwi

Open-Source Machine Translation Quality Estimation in PyTorch
https://unbabel.github.io/OpenKiwi/
GNU Affero General Public License v3.0
229 stars 48 forks source link

How can I continue my predictor training when interrupted? #63

Closed hcdeng6 closed 4 years ago

hcdeng6 commented 4 years ago

Hi, I am using a very large corpus to train a predictor, and I set 6 epochs totally. Each epoch costs me more than 24 hours because of the large-scale corpus. However, it seems that my machine could not stand such a heavy work and the program got interrupted two times when it was on the 4th epoch. However, restarting the kiwi program will waste the former epoch, so I wonder how I can get the checkpoint or continue predictor training from where the program interrupted. Could you tell me what I should do? Thank you.

kepler commented 4 years ago

Hi @hcdeng6,

You should use the --resume flag and specify either --output-dir or --run-uuid to point to your partially trained model (https://unbabel.github.io/OpenKiwi/cli/train.html#training-save-load).

captainvera commented 4 years ago

Hey @hcdeng6 I'm going to assume this issue has been solved.

Feel free to re-open if you still have problems