Question: finetuning pre-trained ELMo and checkpoint

jerrygaoLondon commented 5 years ago

Hi, I'm attempting to fine-tune pre-trained ELMo with my medium sized domain specific corpus (hundreds of millions tokens) by bin/restart.py script. 'n_epochs' is set to 3. With one or two GPU cores, it always takes more than 4 days. I have environment restriction and the 4 days continuous training is the maximum. This means that i have to restart again from the checkpoint when the process is interrupted/killed.

My question is that how does the check-pointing in ELMo works if i have to restart the training/fine-tuning on same corpus ? How could i know when 1 epoch (at least) complete ? Current log does/has not show me the progress for the completion of epochs. I'm concerned that even 1 epoch cannot complete within 4 days and the training will never complete no matter how many times it has been restarted.

Many thanks for your insight.

matt-peters commented 5 years ago

Training will write out a checkpoint every 1250 batches:

https://github.com/allenai/bilm-tf/blob/master/bilm/training.py#L889

jerrygaoLondon commented 5 years ago

Thanks a lot from your reply. I understand the source code about saving state in every 1250 batches now. My question is actually about whether the training will resume from the middle of an epoch ?

Current log does not give me the progress during the training and does not show where it resume after restoring last state (particularly last batch no and last epoch no) from the checkpoint file.

matt-peters commented 5 years ago

Training will resume from the previous checkpoint if you use bin/restart.py. Training doesn't have any notion about resuming in the middle of an epoch, when you restart it will train for however many epochs were specified.

allenai / bilm-tf

Question: finetuning pre-trained ELMo and checkpoint #147