Closed geyutang closed 5 years ago
@geyutang Yeah.. that was for convenience because that allows an easy hacking to save all runtime temporary variables/buffers.
Have you tried resuming the model from "checkpoint.pth" and restart training? I got the above error(TypeError: can't pickle _thread.lock objects). And I could not find any idea to solve this except to replace your trainer object save function to the torch.save(model.statedict) save one. Is it necessary to do this?
@geyutang Yes it should be okay if you did not change your device (a few warnings might jump out, though). I did not make my code device-agnostic. So if you used 4 gpus for the first few epochs and saved the trainer object, you need to also use 4 gpus to load it: torch.save(trainer) literally saves everything within the trainer object including the device config. I guess you have changed device config so that loading might be a problem..
But of course, only saves the weights by torch.save(model.statedict) would be more generalizable.
ok, thanks!
I try to load the checkpoint and resume the trainer for continuing training from 20-th epoch (saved) to 30 epoch. Then the error shows up.
Traceback (most recent call last): File "/home/xxx/project/MAR/src/main.py", line 48, in
main()
File "/home/xxx/project/MAR/src/main.py", line 37, in main
meters_trn = trainer.train_epoch(source_loader, target_loader, epoch)
File "/home/xxx/project/MAR/src/trainers.py", line 154, in train_epoch
save_checkpoint(self, epoch, os.path.join(self.args.save_path, "checkpoints.pth"))
File "/home/xxx/project/MAR/src/utils.py", line 446, in save_checkpoint
torch.save((trainer, epoch), save_path)
File "/home/xxx/.local/lib/python3.6/site-packages/torch/serialization.py", line 224, in save
return _with_file_like(f, "wb", lambda f: _save(obj, f, pickle_module, pickle_protocol))
File "/home/xxx/.local/lib/python3.6/site-packages/torch/serialization.py", line 149, in _with_file_like
return body(f)
File "/home/xxx/.local/lib/python3.6/site-packages/torch/serialization.py", line 224, in
return _with_file_like(f, "wb", lambda f: _save(obj, f, pickle_module, pickle_protocol))
File "/home/xxx/.local/lib/python3.6/site-packages/torch/serialization.py", line 297, in _save
pickler.dump(obj)
TypeError: can't pickle _thread.lock objects
Is that you save the entire trainer object to 'checkpoint.pth', rather than save the model.statedict() and epoch to the 'checkpoint.pth'? I find that saving the model.statedict() is a typical way for resume model.