continue: checkpint resume error

geyutang commented 5 years ago

First, I want to check the right ways to resume the model.

save the trainer checkpoint
modify the args.resume to the place where the checkpoint was saved. ie. runs/debug/checkpoints.pth.
change the args.yaml to a lager epoch.
run the main.py

Following the above steps, the error still exists, even downgrade the PyTorch version to 1.0.0(my previous PyTorch version is 1.1.0).

Traceback (most recent call last): File "/home/xxxxxx/project/MAR/src/main.py", line 46, in main() File "/home/xxxxxx/project/MAR/src/main.py", line 35, in main meters_trn = trainer.train_epoch(source_loader, target_loader, epoch) File "/home/xxxxxx/project/MAR/src/trainers.py", line 155, in train_epoch save_checkpoint(self, epoch, os.path.join(self.args.save_path, "checkpoints.pth")) File "/home/xxxxxx/project/MAR/src/utils.py", line 442, in save_checkpoint torch.save((trainer, epoch), save_path) File "/home/xxxxxx/anaconda3/envs/MAR/lib/python3.6/site-packages/torch/serialization.py", line 218, in save return _with_file_like(f, "wb", lambda f: _save(obj, f, pickle_module, pickle_protocol)) File "/home/xxxxxx/anaconda3/envs/MAR/lib/python3.6/site-packages/torch/serialization.py", line 143, in _with_file_like return body(f) File "/home/xxxxxx/anaconda3/envs/MAR/lib/python3.6/site-packages/torch/serialization.py", line 218, in return _with_file_like(f, "wb", lambda f: _save(obj, f, pickle_module, pickle_protocol)) File "/home/xxxxxx/anaconda3/envs/MAR/lib/python3.6/site-packages/torch/serialization.py", line 291, in _save pickler.dump(obj) TypeError: can't pickle _thread.lock objects

The checkpoint is successfully loaded but fails to save the newer trained checkpoint.

I wonder whether I need to save the newer checkpoint with another name, ie, checkpoint2.pth.
Another way to solve my problem is to change the model save method: save the model.state_dict().

If I have to change the model save method that only save the model.state_dict(), any suggestion about this change? is that I only need to save the model.state_dict() and epoch to the checkpoint? is there any attention need to be paid on other detail?

Thanks for your attention and kind reply.

KovenYu commented 5 years ago

Well actually I also don't quite understand the mechanism that torch.save() adopts, so let's forget about this weird TypeError and simply save the model.state_dict().

Yes there is one thing to pay attention to. This loss has three buffers and this has two buffers. So you should also save the buffer values. When loading the checkpoint, please load it AFTER the Trainer is initialized (because the initialization registers the buffers), e.g. insert the loading code here, and set the buffer values.

Please feel free to let me know if any further problems.

geyutang commented 5 years ago

Thanks for your kindly reply. I will try this. This TypeError bothers me for a long time.

KovenYu commented 5 years ago

You are welcome :)

KovenYu / MAR

continue: checkpint resume error #20