NVIDIA / nccl

Optimized primitives for collective multi-GPU communication
Other
3.27k stars 826 forks source link

NCCL hung witih NCCL_P2P_USE_CUDA_MEMCPY=1 by pytorch #1509

Open adofirst2018 opened 2 weeks ago

adofirst2018 commented 2 weeks ago

Hi NCCL version: v2.21.5

when I set NCCL_P2P_USE_CUDA_MEMCPY=1,train a resnet model using pytorch with two GPUs in same NUMA, NCCL will hung,pytorch timeout crash pytorch error: `[rank1]:[E ProcessGroupNCCL.cpp:563] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=7, OpType=ALLREDUCE, NumelIn=42516042, NumelOut=42516042, Timeout(ms)=600000) ran for 600089 milliseconds before timing out.

[rank1]:[E ProcessGroupNCCL.cpp:1537] [PG 0 Rank 1] Timeout at NCCL work: 7, last enqueued NCCL work: 7, last completed NCCL work: 6.

[rank1]:[E ProcessGroupNCCL.cpp:577] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.

[rank1]:[E ProcessGroupNCCL.cpp:583] [Rank 1] To avoid data inconsistency, we are taking the entire process down.

[rank1]:[E ProcessGroupNCCL.cpp:1414] [PG 0 Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=7, OpType=ALLREDUCE, NumelIn=42516042, NumelOut=42516042, Timeout(ms)=600000) ran for 600089 milliseconds before timing out.

Exception raised from checkTimeout at /opt/pytorch/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):

frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0x99 (0x7f4a66d98e89 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)

frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1e1 (0x7f4a03482121 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)

frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f4a034894e0 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)

frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10f (0x7f4a0348a3ff in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)

frame #4: + 0xdc253 (0x7f4a668b0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)

frame #5: + 0x94ac3 (0x7f4a727eaac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)

frame #6: + 0x126850 (0x7f4a7287c850 in /usr/lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'

` I try to debug ,found hung in p2pSendProxyProgress ,and sub->transmitted=7, sub->done =0,I think this problem is cudaMemcpyAsync still not finish, why cudaMemcpyAsync not finish???? I try write demo but not face this problem, @sjeaugey can give me some advice thanks?mainCIFAR10.txt

sjeaugey commented 2 weeks ago

@sjeaugey can give me some advice thanks?

Do not set NCCL_P2P_USE_CUDA_MEMCPY=1. More generally, do not set environment variables not listed in the system configuration section unless you really know what you're doing.

adofirst2018 commented 2 weeks ago

@sjeaugey can give me some advice thanks?

Do not set NCCL_P2P_USE_CUDA_MEMCPY=1. More generally, do not set environment variables not listed in the system configuration section unless you really know what you're doing.

I want to use cudaMemcpy for higher performance, I should how to debug this bug

sjeaugey commented 2 weeks ago

I want to use cudaMemcpy for higher performance,I should how to debug this bug

Why do you think setting NCCL_P2P_USE_CUDA_MEMCPY=1 is going to increase performance?

adofirst2018 commented 2 weeks ago

I want to use cudaMemcpy for higher performance,I should how to debug this bug

Why do you think setting NCCL_P2P_USE_CUDA_MEMCPY=1 is going to increase performance?

I test nccl in 8 GPUs(L20) by nccl-test, if set NCCL_P2P_USE_CUDA_MEMCPY=1, NCCL_SHM_USE_CUDA_MEMCPY=1, and NCCL_SHM_MEMCPY_MODE=1 will get higher bandwidth, Besides, if set NCCL_P2P_USE_CUDA_MEMCPY=1, NCCL_SHM_USE_CUDA_MEMCPY=1, and NCCL_SHM_MEMCPY_MODE=1, nccl-test can work, but pytorch is hung in nccl.

sjeaugey commented 2 weeks ago

In general, I would advise against setting NCCL_P2P_USE_CUDA_MEMCPY=1. It will increase the base latency very significantly, and can cause hangs (as you experienced). Now, if you find that it makes a significant performance difference (not just 10% .. more like 2x) and large sizes is what your application really relies on, then you can try to set it, together with NCCL_CREATE_THREAD_CONTEXT=1. Not 100% guaranteed to work though.