I want to store the checkpoint after every epoch and start training the next epoch from the stored checkpoint.
I want the training state to remain continued among epochs. For example, when training epoch 2 from the checkpoint of epoch 1, the learning_rate_schedule, epoch nums...should be the same as if I train epoch 2 and epoch 1 together (the vanilla training process).
My implementation is using the argument --recover. Allennlp will store the checkpoint after every epoch. So, for epochs after the first, I add --recover to the training commands, wishing the model's parameters and training states will be restored.However, the above implementation seems wrong because, in my testing, training epoch 2 from the checkpoint of epoch 1 gives different results from training epoch 2 and 1 together.
I tried hard to read the allennlp document but find difficult to figure the problem out. Any guys have comments on my implementation, or other ways to fulfill my requirements? Thanks a lot!!!
For some reason, during training:
My implementation is using the argument --recover. Allennlp will store the checkpoint after every epoch. So, for epochs after the first, I add --recover to the training commands, wishing the model's parameters and training states will be restored. However, the above implementation seems wrong because, in my testing, training epoch 2 from the checkpoint of epoch 1 gives different results from training epoch 2 and 1 together. I tried hard to read the allennlp document but find difficult to figure the problem out. Any guys have comments on my implementation, or other ways to fulfill my requirements? Thanks a lot!!!