Open JerryZeyu opened 6 years ago
@JerryZeyu I also have this phenomenon. I think maybe something in optimizer is not totally saved. However, I haven't figured out the reason.
@BangLiu @JerryZeyu I figured it out! It is caused by EMA. The initialization of ema in QANet_main.py is before resuming the model. The initialization should be moved after the model resuming operation, e.g. after#76 in QANet_trainer.py.
Do you mean that the initialization of ema couldn't exist in QANet_main.py? or just move the self.ema = ema after #76 in QANet_trainer.py.? And I also find that the scheduler also influence it. Because the scheduler need to steps()after one epoch and shouldn't steps after every step. Thanks
@JerryZeyu 'ema.register' should be called after 'self._resume_checkpoint(resume)'
I try to use your method. But it also have some problems. For example, if I resume the epoch 10 and continue to train, the performance of epoch 11, 12 ,13,14, ... are the same with the epoch 10 and couldn't be improved. Can you tell me how to solve it? Thank you very much.
@JerryZeyu It is good for me by just moving the ema from main.py to trainer.py: if resume: self._resume_checkpoint(resume) self.model = self.model.to(self.device) for state in self.optimizer.state.values(): for k, v in state.items(): if isinstance(v, torch.Tensor): state[k] = v.to(self.device)
if self.use_ema:
for name, param in self.model.named_parameters():
if param.requires_grad:
self.ema.register(name, param.data)
Why I resume the checkpoint to continue training, the loss is normal but the test result of new epoch is very low like the first epoch? For example, when I have trained the model for 10 epochs. And then I resume the 10th checkpoint and continue training the 11th epoch. When training the 11th epoch, the loss is normal and low. But when the training of 11th epoch is end, the test result of 11th is very low. Like the 1st epoch's result. Can you tell me the reason? Thank you very much.