Checkpoint resume error

geyutang commented 5 years ago

I try to load the checkpoint and resume the trainer for continuing training from 20-th epoch (saved) to 30 epoch. Then the error shows up.

Traceback (most recent call last): File "/home/xxx/project/MAR/src/main.py", line 48, in main() File "/home/xxx/project/MAR/src/main.py", line 37, in main meters_trn = trainer.train_epoch(source_loader, target_loader, epoch) File "/home/xxx/project/MAR/src/trainers.py", line 154, in train_epoch save_checkpoint(self, epoch, os.path.join(self.args.save_path, "checkpoints.pth")) File "/home/xxx/project/MAR/src/utils.py", line 446, in save_checkpoint torch.save((trainer, epoch), save_path) File "/home/xxx/.local/lib/python3.6/site-packages/torch/serialization.py", line 224, in save return _with_file_like(f, "wb", lambda f: _save(obj, f, pickle_module, pickle_protocol)) File "/home/xxx/.local/lib/python3.6/site-packages/torch/serialization.py", line 149, in _with_file_like return body(f) File "/home/xxx/.local/lib/python3.6/site-packages/torch/serialization.py", line 224, in return _with_file_like(f, "wb", lambda f: _save(obj, f, pickle_module, pickle_protocol)) File "/home/xxx/.local/lib/python3.6/site-packages/torch/serialization.py", line 297, in _save pickler.dump(obj) TypeError: can't pickle _thread.lock objects

Is that you save the entire trainer object to 'checkpoint.pth', rather than save the model.statedict() and epoch to the 'checkpoint.pth'? I find that saving the model.statedict() is a typical way for resume model.

KovenYu commented 5 years ago

@geyutang Yeah.. that was for convenience because that allows an easy hacking to save all runtime temporary variables/buffers.

geyutang commented 5 years ago

Have you tried resuming the model from "checkpoint.pth" and restart training? I got the above error(TypeError: can't pickle _thread.lock objects). And I could not find any idea to solve this except to replace your trainer object save function to the torch.save(model.statedict) save one. Is it necessary to do this?

KovenYu commented 5 years ago

@geyutang Yes it should be okay if you did not change your device (a few warnings might jump out, though). I did not make my code device-agnostic. So if you used 4 gpus for the first few epochs and saved the trainer object, you need to also use 4 gpus to load it: torch.save(trainer) literally saves everything within the trainer object including the device config. I guess you have changed device config so that loading might be a problem..

But of course, only saves the weights by torch.save(model.statedict) would be more generalizable.

geyutang commented 5 years ago

ok, thanks!

KovenYu / MAR

Checkpoint resume error #19