NVIDIA / nccl

Optimized primitives for collective multi-GPU communication
Other
3.15k stars 794 forks source link

NCCL timeout issue #1409

Open wd255 opened 1 month ago

wd255 commented 1 month ago

Hi, I'm having this issue: Watchdog caught collective operation timeout: WorkNCCL(SeqNum=80078...) ran for 600026 milliseconds before timing out

The code I'm running is a VQGAN training script. Parallelism is done with accelerate. This issue happens at the end of each epoch. We believe it's the problem of the environment setup, as the code could run on another machine without any problem

environment: Ubuntu 20 nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2024 NVIDIA Corporation Built on Fri_Jun_14_16:34:21_PDT_2024 Cuda compilation tools, release 12.6, V12.6.20 Build cuda_12.6.r12.6/compiler.34431801_0

full log

    [rank7]:[E816 01:45:40.183009389 ProcessGroupNCCL.cpp:607] [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=80078, OpType=BROADCAST, NumelIn=5056, NumelOut=5056, Timeout(ms)=600000) ran for 600026 milliseconds before timing out.
    [rank7]:[E816 01:45:40.184373931 ProcessGroupNCCL.cpp:1664] [PG 0 (default_pg) Rank 7] Exception (either an error or timeout) detected by watchdog at work: 80078, last enqueued NCCL work: 80078, last completed NCCL work: 80077.
    [rank3]:[E816 01:45:40.212150935 ProcessGroupNCCL.cpp:607] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=80078, OpType=ALLREDUCE, NumelIn=2106881, NumelOut=2106881, Timeout(ms)=600000) ran for 600013 milliseconds before timing out.
    [rank9]:[E816 01:45:40.212150385 ProcessGroupNCCL.cpp:607] [Rank 9] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=80078, OpType=ALLREDUCE, NumelIn=2106881, NumelOut=2106881, Timeout(ms)=600000) ran for 600013 milliseconds before timing out.
    [rank9]:[E816 01:45:40.212958430 ProcessGroupNCCL.cpp:1664] [PG 0 (default_pg) Rank 9] Exception (either an error or timeout) detected by watchdog at work: 80078, last enqueued NCCL work: 80087, last completed NCCL work: 80077.
    [rank3]:[E816 01:45:40.212958690 ProcessGroupNCCL.cpp:1664] [PG 0 (default_pg) Rank 3] Exception (either an error or timeout) detected by watchdog at work: 80078, last enqueued NCCL work: 80087, last completed NCCL work: 80077.
    [rank8]:[E816 01:45:40.265372190 ProcessGroupNCCL.cpp:607] [Rank 8] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=80078, OpType=ALLREDUCE, NumelIn=2106881, NumelOut=2106881, Timeout(ms)=600000) ran for 600066 milliseconds before timing out.
    [rank8]:[E816 01:45:40.266489823 ProcessGroupNCCL.cpp:1664] [PG 0 (default_pg) Rank 8] Exception (either an error or timeout) detected by watchdog at work: 80078, last enqueued NCCL work: 80087, last completed NCCL work: 80077.
    [rank1]:[E816 01:45:40.278487367 ProcessGroupNCCL.cpp:607] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=80078, OpType=ALLREDUCE, NumelIn=2106881, NumelOut=2106881, Timeout(ms)=600000) ran for 600079 milliseconds before timing out.
    [rank1]:[E816 01:45:40.279310302 ProcessGroupNCCL.cpp:1664] [PG 0 (default_pg) Rank 1] Exception (either an error or timeout) detected by watchdog at work: 80078, last enqueued NCCL work: 80087, last completed NCCL work: 80077.
    [rank5]:[E816 01:45:40.285038686 ProcessGroupNCCL.cpp:607] [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=80078, OpType=ALLREDUCE, NumelIn=2106881, NumelOut=2106881, Timeout(ms)=600000) ran for 600085 milliseconds before timing out.
    [rank5]:[E816 01:45:40.286293758 ProcessGroupNCCL.cpp:1664] [PG 0 (default_pg) Rank 5] Exception (either an error or timeout) detected by watchdog at work: 80078, last enqueued NCCL work: 80087, last completed NCCL work: 80077.
    [rank2]:[E816 01:45:40.292277610 ProcessGroupNCCL.cpp:607] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=80078, OpType=ALLREDUCE, NumelIn=2106881, NumelOut=2106881, Timeout(ms)=600000) ran for 600093 milliseconds before timing out.
    [rank0]:[E816 01:45:40.292277580 ProcessGroupNCCL.cpp:607] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=80078, OpType=ALLREDUCE, NumelIn=2106881, NumelOut=2106881, Timeout(ms)=600000) ran for 600090 milliseconds before timing out.
    [rank0]:[E816 01:45:40.293076425 ProcessGroupNCCL.cpp:1664] [PG 0 (default_pg) Rank 0] Exception (either an error or timeout) detected by watchdog at work: 80078, last enqueued NCCL work: 80087, last completed NCCL work: 80077.
    [rank2]:[E816 01:45:40.293076395 ProcessGroupNCCL.cpp:1664] [PG 0 (default_pg) Rank 2] Exception (either an error or timeout) detected by watchdog at work: 80078, last enqueued NCCL work: 80087, last completed NCCL work: 80077.
    [rank6]:[E816 01:45:40.296233705 ProcessGroupNCCL.cpp:607] [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=80078, OpType=ALLREDUCE, NumelIn=2106881, NumelOut=2106881, Timeout(ms)=600000) ran for 600097 milliseconds before timing out.
    [rank6]:[E816 01:45:40.298295572 ProcessGroupNCCL.cpp:1664] [PG 0 (default_pg) Rank 6] Exception (either an error or timeout) detected by watchdog at work: 80078, last enqueued NCCL work: 80087, last completed NCCL work: 80077.
      0%|                                                                                                                                                 | 0/8008 [10:00<?, ?it/s]
    [rank7]: Traceback (most recent call last):
    [rank7]:   File "/home/duanwei/ML/magvit_imagenet.py", line 480, in <module>
    [rank7]:     for batch_idx, (real, _) in zip(pbar, dataloader):
    [rank7]:   File "/root/miniconda3/envs/magvit/lib/python3.12/site-packages/accelerate/data_loader.py", line 447, in __iter__
    [rank7]:     synchronize_rng_states(self.rng_types, self.synchronized_generator)
    [rank7]:   File "/root/miniconda3/envs/magvit/lib/python3.12/site-packages/accelerate/utils/random.py", line 132, in synchronize_rng_states
    [rank7]:     synchronize_rng_state(RNGType(rng_type), generator=generator)
    [rank7]:   File "/root/miniconda3/envs/magvit/lib/python3.12/site-packages/accelerate/utils/random.py", line 127, in synchronize_rng_state
    [rank7]:     generator.set_state(rng_state)
    [rank7]: RuntimeError: Invalid mt19937 state
    [rank7]:[E816 01:45:41.635484056 ProcessGroupNCCL.cpp:1709] [PG 0 (default_pg) Rank 7] Timeout at NCCL work: 80078, last enqueued NCCL work: 80078, last completed NCCL work: 80077.
    [rank7]:[E816 01:45:41.635510825 ProcessGroupNCCL.cpp:621] [Rank 7] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
    [rank7]:[E816 01:45:41.635516125 ProcessGroupNCCL.cpp:627] [Rank 7] To avoid data inconsistency, we are taking the entire process down.
    [rank7]:[E816 01:45:41.636462549 ProcessGroupNCCL.cpp:1515] [PG 0 (default_pg) Rank 7] Process group watchdog thread terminated with exception: [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=80078, OpType=BROADCAST, NumelIn=5056, NumelOut=5056, Timeout(ms)=600000) ran for 600026 milliseconds before timing out.
    Exception raised from checkTimeout at /opt/conda/conda-bld/pytorch_1720538439675/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:609 (most recent call first):
    frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f8b9912af86 in /root/miniconda3/envs/magvit/lib/python3.12/site-packages/torch/lib/libc10.so)
    frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f8b9a40e0b2 in /root/miniconda3/envs/magvit/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
    frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f8b9a414af3 in /root/miniconda3/envs/magvit/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
    frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f8b9a416edc in /root/miniconda3/envs/magvit/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
    frame #4: <unknown function> + 0xd3b55 (0x7f8bea0f0b55 in /root/miniconda3/envs/magvit/lib/python3.12/site-packages/torch/lib/../../../.././libstdc++.so.6)
    frame #5: <unknown function> + 0x94ac3 (0x7f8bfaf3eac3 in /lib/x86_64-linux-gnu/libc.so.6)
    frame #6: <unknown function> + 0x126850 (0x7f8bfafd0850 in /lib/x86_64-linux-gnu/libc.so.6)

    terminate called after throwing an instance of 'c10::DistBackendError'
      what():  [PG 0 (default_pg) Rank 7] Process group watchdog thread terminated with exception: [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=80078, OpType=BROADCAST, NumelIn=5056, NumelOut=5056, Timeout(ms)=600000) ran for 600026 milliseconds before timing out.
    Exception raised from checkTimeout at /opt/conda/conda-bld/pytorch_1720538439675/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:609 (most recent call first):
    frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f8b9912af86 in /root/miniconda3/envs/magvit/lib/python3.12/site-packages/torch/lib/libc10.so)
    frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f8b9a40e0b2 in /root/miniconda3/envs/magvit/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
    frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f8b9a414af3 in /root/miniconda3/envs/magvit/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
    frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f8b9a416edc in /root/miniconda3/envs/magvit/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
    frame #4: <unknown function> + 0xd3b55 (0x7f8bea0f0b55 in /root/miniconda3/envs/magvit/lib/python3.12/site-packages/torch/lib/../../../.././libstdc++.so.6)
    frame #5: <unknown function> + 0x94ac3 (0x7f8bfaf3eac3 in /lib/x86_64-linux-gnu/libc.so.6)
    frame #6: <unknown function> + 0x126850 (0x7f8bfafd0850 in /lib/x86_64-linux-gnu/libc.so.6)

    Exception raised from ncclCommWatchdog at /opt/conda/conda-bld/pytorch_1720538439675/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1521 (most recent call first):
    frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f8b9912af86 in /root/miniconda3/envs/magvit/lib/python3.12/site-packages/torch/lib/libc10.so)
    frame #1: <unknown function> + 0xe3ec34 (0x7f8b9a096c34 in /root/miniconda3/envs/magvit/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
    frame #2: <unknown function> + 0xd3b55 (0x7f8bea0f0b55 in /root/miniconda3/envs/magvit/lib/python3.12/site-packages/torch/lib/../../../.././libstdc++.so.6)
    frame #3: <unknown function> + 0x94ac3 (0x7f8bfaf3eac3 in /lib/x86_64-linux-gnu/libc.so.6)
    frame #4: <unknown function> + 0x126850 (0x7f8bfafd0850 in /lib/x86_64-linux-gnu/libc.so.6)

    W0816 01:45:41.681000 140650667333440 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 340257 closing signal SIGTERM
    W0816 01:45:41.682000 140650667333440 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 340258 closing signal SIGTERM
    W0816 01:45:41.682000 140650667333440 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 340259 closing signal SIGTERM
    W0816 01:45:41.683000 140650667333440 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 340260 closing signal SIGTERM
    W0816 01:45:41.684000 140650667333440 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 340261 closing signal SIGTERM
    W0816 01:45:41.684000 140650667333440 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 340262 closing signal SIGTERM
    W0816 01:45:41.684000 140650667333440 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 340263 closing signal SIGTERM
    W0816 01:45:41.684000 140650667333440 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 340265 closing signal SIGTERM
    W0816 01:45:41.685000 140650667333440 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 340266 closing signal SIGTERM
    E0816 01:45:43.466000 140650667333440 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: -6) local_rank: 7 (pid: 340264) of binary: /root/miniconda3/envs/magvit/bin/python3.12
    Traceback (most recent call last):
      File "/root/miniconda3/envs/magvit/bin/accelerate", line 10, in <module>
        sys.exit(main())
                 ^^^^^^
      File "/root/miniconda3/envs/magvit/lib/python3.12/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main
        args.func(args)
      File "/root/miniconda3/envs/magvit/lib/python3.12/site-packages/accelerate/commands/launch.py", line 1097, in launch_command
        multi_gpu_launcher(args)
      File "/root/miniconda3/envs/magvit/lib/python3.12/site-packages/accelerate/commands/launch.py", line 734, in multi_gpu_launcher
        distrib_run.run(args)
      File "/root/miniconda3/envs/magvit/lib/python3.12/site-packages/torch/distributed/run.py", line 892, in run
        elastic_launch(
      File "/root/miniconda3/envs/magvit/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 133, in __call__
        return launch_agent(self._config, self._entrypoint, list(args))
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "/root/miniconda3/envs/magvit/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
        raise ChildFailedError(
    torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
    =======================================================
    magvit_imagenet.py FAILED

---
    Failures:
      <NO_OTHER_FAILURES>

---
    Root Cause (first observed failure):
    [0]:
      time      : 2024-08-16_01:45:41
      host      : kansas3
      rank      : 7 (local_rank: 7)
      exitcode  : -6 (pid: 340264)
      error_file: <N/A>
      traceback : Signal 6 (SIGABRT) received by PID 340264
    =======================================================

Can anyone help me with it?

kiskra-nvidia commented 1 month ago

Most of the messages you included come from PyTorch; as far as I can see, none of them are generated by NCCL. PyTorch simply reports that a NCCL collective operation has timed out. That could be for any number of reasons, from the timeout being too short, through an application error, a bug in NCCL, system misconfiguration, to hardware issues.

You may want to try running your app with the NCCL_DEBUG=WARN environment variable set -- that way if NCCL encounters an issue, it will print an error message. But, given that we don't see PyTorch reporting any errors from NCCL, I wouldn't count on getting anything that way.

Does NCCL run on that system at all? E.g., have you been able to run some of the NCCL tests like all_reduce_perf to a successful completion?

wd255 commented 1 month ago

Thanks! I increased the timeout and will give it another try with NCCL_DEBUG=WARN. I'll also run NCCL test. It's a fresh machine where NCCL is newly installed, so maybe there's something wrong with the NCCL setup?