Bluefog-Lib / bluefog

Distributed and decentralized training framework for PyTorch over graph
https://bluefog-lib.github.io/bluefog/
Apache License 2.0
291 stars 71 forks source link

Model trained by AWC style cannot be saved #103

Open kunyuan827 opened 2 years ago

kunyuan827 commented 2 years ago

The model trained by neighbor_allreduce optimizer and awc communication style cannot be saved with the command "torch.save()". See the error information below.

The possible reason might be in make_hook function in optimizer.py

================== File "XXX.py", line 227, in torch.save(model, tmp_path) File "/usr/local/lib/python3.7/dist-packages/torch/serialization.py", line 328, in save _legacy_save(obj, opened_file, pickle_module, pickle_protocol) File "/usr/local/lib/python3.7/dist-packages/torch/serialization.py", line 401, in _legacy_save pickler.dump(obj) AttributeError: Can't pickle local object '_DistributedReduceOptimizer._make_hook..hook'