BaguaSys / bagua

Bagua Speeds up PyTorch
https://tutorials-8ro.pages.dev/
MIT License
872 stars 83 forks source link

runtime error on pytorch 1.10 #337

Closed saintazunya closed 2 years ago

saintazunya commented 2 years ago

Describe the bug A clear and concise description of what the bug is.

Environment

Reproducing

Please provide a minimal working example. This means the runnable code.

Please also write what exact commands are required to reproduce your results.

Just run Bagua example's benchmark script.

Additional context Add any other context about the problem here.

Traceback (most recent call last):
  File "/io/bagua/bagua/examples/benchmark/synthetic_benchmark.py", line 154, in <module>
    model = model.with_bagua([optimizer], algorithm)
  File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/bagua/torch_api/distributed.py", line 396, in with_bagua
    self._bagua_init_algorithm()
  File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/bagua/torch_api/distributed.py", line 441, in _bagua_init_algorithm
    self._bagua_broadcast_parameters()
  File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/bagua/torch_api/distributed.py", line 213, in _bagua_broadcast_parameters
    broadcast(state, src=0)
  File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/bagua/torch_api/communication.py", line 523, in broadcast
    comm.broadcast(tensor.to_bagua_tensor().bagua_backend_tensor(), src)
  File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/bagua/torch_api/tensor.py", line 79, in to_bagua_tensor
    new_tensor = torch.Tensor(cdata=self._cdata)
RuntimeError: Creating a new Tensor subclass Tensor but the raw Tensor object is already associated to a python object of type Parameter
Killing subprocess 764
NOBLES5E commented 2 years ago

It seems that PyTorch 1.10 refuses to create new tensor from cdata pointer.

This will be fixed in next release.