BangLiu / QANet-PyTorch

Re-implement "QANet: Combining Local Convolution with Global Self-Attention for Reading Comprehension"
MIT License
120 stars 26 forks source link

Resume checkpoint #4

Open JerryZeyu opened 6 years ago

JerryZeyu commented 6 years ago

Why I resume the checkpoint to continue training, the loss is normal but the test result of new epoch is very low like the first epoch? For example, when I have trained the model for 10 epochs. And then I resume the 10th checkpoint and continue training the 11th epoch. When training the 11th epoch, the loss is normal and low. But when the training of 11th epoch is end, the test result of 11th is very low. Like the 1st epoch's result. Can you tell me the reason? Thank you very much.

BangLiu commented 6 years ago

@JerryZeyu I also have this phenomenon. I think maybe something in optimizer is not totally saved. However, I haven't figured out the reason.

zhangchen010295 commented 6 years ago

@BangLiu @JerryZeyu I figured it out! It is caused by EMA. The initialization of ema in QANet_main.py is before resuming the model. The initialization should be moved after the model resuming operation, e.g. after#76 in QANet_trainer.py.

JerryZeyu commented 6 years ago

Do you mean that the initialization of ema couldn't exist in QANet_main.py? or just move the self.ema = ema after #76 in QANet_trainer.py.? And I also find that the scheduler also influence it. Because the scheduler need to steps()after one epoch and shouldn't steps after every step. Thanks

zhangchen010295 commented 6 years ago

@JerryZeyu 'ema.register' should be called after 'self._resume_checkpoint(resume)'

JerryZeyu commented 6 years ago

I try to use your method. But it also have some problems. For example, if I resume the epoch 10 and continue to train, the performance of epoch 11, 12 ,13,14, ... are the same with the epoch 10 and couldn't be improved. Can you tell me how to solve it? Thank you very much.

zhangchen010295 commented 6 years ago

@JerryZeyu It is good for me by just moving the ema from main.py to trainer.py: if resume: self._resume_checkpoint(resume) self.model = self.model.to(self.device) for state in self.optimizer.state.values(): for k, v in state.items(): if isinstance(v, torch.Tensor): state[k] = v.to(self.device)

moved from main.py because the model may be resumed

    if self.use_ema:
        for name, param in self.model.named_parameters():
            if param.requires_grad:
                self.ema.register(name, param.data)