BlackSamorez / tensor_parallel

Automatically split your PyTorch models on multiple GPUs for training & inference
MIT License
624 stars 40 forks source link

RuntimeError: NCCL Error 3: internal error #121

Open smallmocha opened 1 year ago

smallmocha commented 1 year ago

[0] NCCL INFO cudaDriverVersion 11040 [0] NCCL INFO Bootstrap : Using eth0:10.84.253.70<0> [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation

File "/usr/local/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(args, kwargs) File "/usr/local/lib/python3.7/site-packages/transformers/generation/utils.py", line 1496, in generate model_kwargs, File "/usr/local/lib/python3.7/site-packages/transformers/generation/utils.py", line 2528, in sample output_hidden_states=output_hidden_states, File "/usr/local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl return forward_call(input, kwargs) File "/usr/local/lib/python3.7/site-packages/tensor_parallel/pretrained_model.py", line 76, in forward return self.wrapped_model(*args, *kwargs) File "/usr/local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl return forward_call(input, kwargs) File "/usr/local/lib/python3.7/site-packages/tensor_parallel/tensor_parallel.py", line 147, in forward self.sharding_manager.synchronize_weights(self.all_cuda) File "/usr/local/lib/python3.7/site-packages/tensor_parallel/sharding.py", line 77, in synchronize_weights gathered_shards = all_gather(list(self.flat_shards), all_cuda=all_cuda) File "/usr/local/lib/python3.7/site-packages/tensor_parallel/cross_device_ops.py", line 58, in all_gather return NCCLAllGatherFunction.apply(*tensors) File "/usr/local/lib/python3.7/site-packages/tensor_parallel/cross_device_ops.py", line 100, in forward nccl.all_gather(inputs, outputs) File "/usr/local/lib/python3.7/site-packages/torch/cuda/nccl.py", line 104, in all_gather torch._C._nccl_all_gather(inputs, outputs, streams, comms) RuntimeError: NCCL Error 3: internal error

[0] NCCL INFO Failed to open libibverbs.so[.1] [0] NCCL INFO NET/Socket : Using [0]eth0:10.84.253.70<0> [0] NCCL INFO Using network Socket [1] NCCL INFO Using network Socket

[0] misc/nvmlwrap.cc:63 NCCL WARN Failed to open libnvidia-ml.so.1 [0] NCCL INFO misc/nvmlwrap.cc:179 -> 2 [1] NCCL INFO misc/nvmlwrap.cc:179 -> 2

[0] graph/xml.cc:634 NCCL WARN No NVML device handle. Skipping nvlink detection.

[1] graph/xml.cc:634 NCCL WARN No NVML device handle. Skipping nvlink detection. [0] NCCL INFO misc/nvmlwrap.cc:179 -> 2 [1] NCCL INFO misc/nvmlwrap.cc:179 -> 2

[0] graph/xml.cc:634 NCCL WARN No NVML device handle. Skipping nvlink detection.

[1] graph/xml.cc:634 NCCL WARN No NVML device handle. Skipping nvlink detection. [0] NCCL INFO graph/paths.cc:308 -> 2 [0] NCCL INFO graph/paths.cc:532 -> 2 [0] NCCL INFO init.cc:535 -> 2 [0] NCCL INFO init.cc:1089 -> 2 [0] NCCL INFO group.cc:64 -> 2 [Async thread] [1] NCCL INFO graph/paths.cc:308 -> 2 [1] NCCL INFO graph/paths.cc:532 -> 2 [1] NCCL INFO init.cc:535 -> 2 [1] NCCL INFO init.cc:1089 -> 2 [1] NCCL INFO group.cc:64 -> 2 [Async thread] [0] NCCL INFO group.cc:421 -> 3 [0] NCCL INFO group.cc:106 -> 3 [0] NCCL INFO init.cc:1226 -> 3

GeneZC commented 6 months ago

Also faced with this issue, any updates?