astramind-ai / BitMat

An efficent implementation of the method proposed in "The Era of 1-bit LLMs"
Apache License 2.0
154 stars 9 forks source link

does not work w/ MultiGPU #11

Open winglian opened 7 months ago

winglian commented 7 months ago

rank0: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl rank0: return self._call_impl(*args, kwargs) rank0: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl rank0: return forward_call(*args, *kwargs) rank0: File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/distributed.py", line 1519, in forward rank0: inputs, kwargs = self._pre_forward(inputs, kwargs) rank0: File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/distributed.py", line 1420, in _pre_forward

rank0: File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/distributed.py", line 2051, in _sync_buffers

rank0: File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/distributed.py", line 2055, in _sync_module_buffers

rank0: File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/distributed.py", line 2077, in _default_broadcast_coalesced rank0: self._distributed_broadcast_coalesced(bufs, bucket_size, authoritative_rank) rank0: File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/distributed.py", line 1992, in _distributed_broadcast_coalesced

rank0: RuntimeError: !tensors.empty() INTERNAL ASSERT FAILED at "/opt/pytorch/pytorch/torch/csrc/distributed/c10d/reducer.cpp":2089, please report a bug to PyTorch.

This only seems to work on single GPU and it converges much more slowly than a less optimized version of the BitLinear module I've tried.

Screenshot 2024-04-15 at 2 28 19 PM
mlinmg commented 7 months ago

As mentioned in issue #12, we are in the process of developing the kernel in pure CUDA due to the instability of the current implementation. This is indeed a peculiar result; it would be very helpful if you could provide more information or the reference implementation you are referring to. We don't use the kernel in training, so I suspect there might be an issue with the manual backpropagation implementation.