NVIDIA / nccl

Optimized primitives for collective multi-GPU communication
Other
3.13k stars 788 forks source link

ncclInternalError while fine-tuning using deepspeed #827

Open udhavsethi opened 1 year ago

udhavsethi commented 1 year ago

I am trying to run a training script using deepspeed on 8 32GB V100 GPUs.

For debugging, I enabled the following flags:

NCCL_DEBUG=INFO
NCCL_DEBUG_SUBSYS=INIT,GRAPH
NCCL_TOPO_DUMP_FILE=topo.xml

I am running into the following errors:

Traceback (most recent call last):
  File "/root/chat-llm/stanford_alpaca/train.py", line 222, in <module>
    train()
  File "/root/chat-llm/stanford_alpaca/train.py", line 186, in train
    model = transformers.LlamaForCausalLM.from_pretrained(
  File "/root/chat-llm/stanford_alpaca/venv/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2498, in from_pretrained
    model = cls(config, *model_args, **model_kwargs)
  File "/root/chat-llm/stanford_alpaca/venv/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 382, in wrapper
    f(module, *args, **kwargs)
  File "/root/chat-llm/stanford_alpaca/venv/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 659, in __init__
    self.model = LlamaModel(config)
  File "/root/chat-llm/stanford_alpaca/venv/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 382, in wrapper
    f(module, *args, **kwargs)
  File "/root/chat-llm/stanford_alpaca/venv/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 463, in __init__
    self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size, self.padding_idx)
  File "/root/chat-llm/stanford_alpaca/venv/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 389, in wrapper
    self._post_init_method(module)
  File "/root/chat-llm/stanford_alpaca/venv/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 782, in _post_init_method
    dist.broadcast(param, 0, self.ds_process_group)
  File "/root/chat-llm/stanford_alpaca/venv/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 120, in log_wrapper
    return func(*args, **kwargs)
  File "/root/chat-llm/stanford_alpaca/venv/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 217, in broadcast
    return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op)
  File "/root/chat-llm/stanford_alpaca/venv/lib/python3.10/site-packages/deepspeed/comm/torch.py", line 81, in broadcast
    return torch.distributed.broadcast(tensor=tensor, src=src, group=group, async_op=async_op)
  File "/root/chat-llm/stanford_alpaca/venv/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1436, in wrapper
    return func(*args, **kwargs)
  File "/root/chat-llm/stanford_alpaca/venv/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1551, in broadcast
    work = default_pg.broadcast([tensor], opts)
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1275, internal error, NCCL version 2.14.3
ncclInternalError: Internal check failed.
Last error:
Net : Call to recv from 10.233.121.250<45143> failed : Connection reset by peer
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 36604 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 36605 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -7) local_rank: 0 (pid: 36601) of binary: /root/chat-llm/stanford_alpaca/venv/bin/python3.10
Traceback (most recent call last):
  File "/root/chat-llm/stanford_alpaca/venv/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/root/chat-llm/stanford_alpaca/venv/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/root/chat-llm/stanford_alpaca/venv/lib/python3.10/site-packages/torch/distributed/run.py", line 794, in main
    run(args)
  File "/root/chat-llm/stanford_alpaca/venv/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/root/chat-llm/stanford_alpaca/venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/root/chat-llm/stanford_alpaca/venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
=====================================================
train.py FAILED
-----------------------------------------------------
-----------------------------------------------------
Failures:
[1]:
  time      : 2023-04-18_15:47:13
  host      : usethi-fullnode-alpaca-finetune-fml5b
  rank      : 1 (local_rank: 1)
  exitcode  : -7 (pid: 36602)
  error_file: <N/A>
  traceback : Signal 7 (SIGBUS) received by PID 36602
[2]:
  time      : 2023-04-18_15:47:13
  host      : usethi-fullnode-alpaca-finetune-fml5b
  rank      : 2 (local_rank: 2)
  exitcode  : -7 (pid: 36603)
  error_file: <N/A>
  traceback : Signal 7 (SIGBUS) received by PID 36603
[3]:
  time      : 2023-04-18_15:47:13
  host      : usethi-fullnode-alpaca-finetune-fml5b
  rank      : 5 (local_rank: 5)
  exitcode  : -7 (pid: 36606)
  error_file: <N/A>
  traceback : Signal 7 (SIGBUS) received by PID 36606
[4]:
  time      : 2023-04-18_15:47:13
  host      : usethi-fullnode-alpaca-finetune-fml5b
  rank      : 6 (local_rank: 6)
  exitcode  : -7 (pid: 36607)
  error_file: <N/A>
  traceback : Signal 7 (SIGBUS) received by PID 36607
[5]:
  time      : 2023-04-18_15:47:13
  host      : usethi-fullnode-alpaca-finetune-fml5b
  rank      : 7 (local_rank: 7)
  exitcode  : -7 (pid: 36608)
  error_file: <N/A>
  traceback : Signal 7 (SIGBUS) received by PID 36608
-----------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-04-18_15:47:13
  host      : usethi-fullnode-alpaca-finetune-fml5b
  rank      : 0 (local_rank: 0)
  exitcode  : -7 (pid: 36601)
  error_file: <N/A>
  traceback : Signal 7 (SIGBUS) received by PID 36601
=====================================================

Here is my nvcc version:

$nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Mon_May__3_19:15:13_PDT_2021
Cuda compilation tools, release 11.3, V11.3.109
Build cuda_11.3.r11.3/compiler.29920130_0

and nccl version:

$python -c "import torch;print(torch.cuda.nccl.version())"
(2, 14, 3)

Here is the dumped xml file: topo.xml.txt

Please let me know if I can provide any other information to identify the source of this issue. I would highly appreciate any help or guidance on how to make this work.

sjeaugey commented 1 year ago

It doesn't look like NCCL_DEBUG=INFO was taken into account. We should see NCCL INFO or NCCL WARN messages before the error happens.

udhavsethi commented 1 year ago

My bad, please see the attached log for the whole output: log.txt

sjeaugey commented 1 year ago

I could not find anything clear. It would look like some of the ranks are failing (like, crashing outside of NCCL) and that causes NCCL to fail to connect to those ranks and report an error. The error reported is InternalError, which I believe is wrong, and the code returning that error has changed on recent versions, so perhaps we should not return that error code since it's not an internal error but just a classic remote error where the other side is not responding.

Could you run with a more recent NCCL? Also could you check how each rank exits? Some ranks do error out in NCCL (as they could not connect to other ranks) but maybe some other ranks are exiting differently due to some other reason.