NVIDIA / nccl

Optimized primitives for collective multi-GPU communication
Other
3.15k stars 794 forks source link

ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. #1240

Open amitagh opened 6 months ago

amitagh commented 6 months ago

Getting this error while pretraining LLama2 on A100 gpu. Using NCCL version 2.19.3. Running it on single vm with single A100 GPU.

Spotllm:73025:73025 [0] NCCL INFO Bootstrap : Using eth0:10.0.0.4<0> Spotllm:73025:73025 [0] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation Spotllm:73025:73025 [0] NCCL INFO cudaDriverVersion 12020 NCCL version 2.19.3+cuda12.3 Spotllm:73025:73112 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 1. Spotllm:73025:73112 [0] NCCL INFO NET/Socket : Using [0]eth0:10.0.0.4<0> Spotllm:73025:73112 [0] NCCL INFO Using non-device net plugin version 0 Spotllm:73025:73112 [0] NCCL INFO Using network Socket Spotllm:73025:73112 [0] NCCL INFO comm 0x2012b5e0 rank 0 nranks 1 cudaDev 0 nvmlDev 0 busId 100000 commId 0x151ada46fe52b960 - Init START Spotllm:73025:73112 [0] NCCL INFO NCCL_TOPO_FILE set by environment to /opt/microsoft/ncv4/topo.xml Spotllm:73025:73112 [0] NCCL INFO Setting affinity for GPU 0 to ffffff Spotllm:73025:73112 [0] NCCL INFO NCCL_GRAPH_FILE set by environment to /opt/microsoft/ncv4/graph.xml

Spotllm:73025:73112 [0] graph/search.cc:719 NCCL WARN XML Import Channel : dev 1 not found. Spotllm:73025:73112 [0] NCCL INFO graph/search.cc:749 -> 2 Spotllm:73025:73112 [0] NCCL INFO graph/search.cc:756 -> 2 Spotllm:73025:73112 [0] NCCL INFO graph/search.cc:873 -> 2 Spotllm:73025:73112 [0] NCCL INFO init.cc:921 -> 2 Spotllm:73025:73112 [0] NCCL INFO init.cc:1396 -> 2 Spotllm:73025:73112 [0] NCCL INFO group.cc:64 -> 2 [Async thread] Spotllm:73025:73025 [0] NCCL INFO group.cc:418 -> 2 Spotllm:73025:73025 [0] NCCL INFO group.cc:95 -> 2 Traceback (most recent call last): File "run_clm_with_peft.py", line 937, in main() File "run_clm_with_peft.py", line 899, in main train_result = trainer.train(resume_from_checkpoint=checkpoint) File "/home/azureuser/.local/lib/python3.8/site-packages/transformers/trainer.py", line 1780, in train return inner_training_loop( File "/home/azureuser/.local/lib/python3.8/site-packages/transformers/trainer.py", line 1933, in _inner_training_loop model, self.optimizer = self.accelerator.prepare(self.model, self.optimizer) File "/home/azureuser/.local/lib/python3.8/site-packages/accelerate/accelerator.py", line 1255, in prepare result = self._prepare_deepspeed(*args) File "/home/azureuser/.local/lib/python3.8/site-packages/accelerate/accelerator.py", line 1640, in _preparedeepspeed engine, optimizer, , lr_scheduler = deepspeed.initialize(kwargs) File "/home/azureuser/.local/lib/python3.8/site-packages/deepspeed/init.py", line 176, in initialize engine = DeepSpeedEngine(args=args, File "/home/azureuser/.local/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 262, in init self._configure_distributed_model(model) File "/home/azureuser/.local/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1157, in _configure_distributed_model self._broadcast_model() File "/home/azureuser/.local/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1077, in _broadcast_model dist.broadcast(p.data, groups._get_broadcast_src_rank(), group=self.seq_data_parallel_group) File "/home/azureuser/.local/lib/python3.8/site-packages/deepspeed/comm/comm.py", line 117, in log_wrapper return func(*args, *kwargs) File "/home/azureuser/.local/lib/python3.8/site-packages/deepspeed/comm/comm.py", line 224, in broadcast return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op) File "/home/azureuser/.local/lib/python3.8/site-packages/torch/_dynamo/eval_frame.py", line 489, in _fn return fn(args, kwargs) File "/home/azureuser/.local/lib/python3.8/site-packages/deepspeed/comm/torch.py", line 205, in broadcast return torch.distributed.broadcast(tensor=tensor, src=src, group=group, async_op=async_op) File "/home/azureuser/.local/lib/python3.8/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper return func(*args, *kwargs) File "/home/azureuser/.local/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1914, in broadcast work = group.broadcast([tensor], opts) torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1691, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.19.3 ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. Last error: XML Import Channel : dev 1 not found. Spotllm:73025:73025 [0] NCCL INFO comm 0x2012b5e0 rank 0 nranks 1 cudaDev 0 busId 100000 - Abort COMPLETE [2024-03-29 17:38:19,073] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 73025) of binary: /usr/bin/python3 Traceback (most recent call last): File "/home/azureuser/.local/bin/torchrun", line 8, in sys.exit(main()) File "/home/azureuser/.local/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 347, in wrapper return f(args, **kwargs) File "/home/azureuser/.local/lib/python3.8/site-packages/torch/distributed/run.py", line 812, in main run(args) File "/home/azureuser/.local/lib/python3.8/site-packages/torch/distributed/run.py", line 803, in run elastic_launch( File "/home/azureuser/.local/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 135, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/azureuser/.local/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

run_clm_with_peft.py FAILED

Failures:

------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-03-29_17:38:19 host : spotllm.internal.cloudapp.net rank : 0 (local_rank: 0) exitcode : 1 (pid: 73025) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================ azureuser@Spotllm:/dev/shm/llm/pretrain$ Remote side unexpectedly closed network connection
sjeaugey commented 6 months ago

This is the problem:

Spotllm:73025:73112 [0] NCCL INFO NCCL_GRAPH_FILE set by environment to /opt/microsoft/ncv4/graph.xml

Your NCCL_GRAPH_FILE parameter (environment variable or in /etc/nccl.conf, or in ~/.nccl.conf) is set to /opt/microsoft/ncv4/graph.xml. This parameter should never be set in production. You should unset it.

amitagh commented 6 months ago

Thanks @sjeaugey . I didnt set it. It was as per the default installation for torch package. Shall i unset it and try it out?

sjeaugey commented 6 months ago

If it is set in the environment, you should unset it. If it is in /etc/nccl.conf or ~/.nccl.conf, you should remove it.

amitagh commented 6 months ago

Thank you @sjeaugey it is solved with your fix. I no more see the nccl error i was getting before. THank you very much.