Multi gpus problem - Githubissues

YisuiTT commented 1 month ago

This work is great, but when running on three GPUs with three prompts, I get the following error, how do I fix this?

Rank 1 is running. Rank 0 is running. Rank 2 is running. Loading pipeline components...: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:02<00:00, 2.27it/s] Loading pipeline components...: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:02<00:00, 1.92it/s] Loading pipeline components...: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:02<00:00, 1.99it/s] 0%| | 0/30 [00:00<?, ?it/s]Found 34 attns Found 22 convs Found 34 attns Found 22 convs 0%| | 0/30 [00:00<?, ?it/s][rank1]:[E ProcessGroupNCCL.cpp:523] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLGATHER, NumelIn=15728640, NumelOut=47185920, Timeout(ms)=600000) ran for 600149 milliseconds before timing out. [rank0]:[E ProcessGroupNCCL.cpp:523] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLGATHER, NumelIn=15728640, NumelOut=47185920, Timeout(ms)=600000) ran for 600152 milliseconds before timing out. [rank0]:[E ProcessGroupNCCL.cpp:537] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank0]:[E ProcessGroupNCCL.cpp:543] To avoid data inconsistency, we are taking the entire process down. [rank0]:[E ProcessGroupNCCL.cpp:1182] [Rank 0] NCCL watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLGATHER, NumelIn=15728640, NumelOut=47185920, Timeout(ms)=600000) ran for 600152 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:525 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7ff49360ed87 in /.conda/envs/V_I_vc2/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1e6 (0x7ff4947b66e6 in /.conda/envs/V_I_vc2/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x19d (0x7ff4947b9c3d in /.conda/envs/V_I_vc2/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x7ff4947ba839 in /.conda/envs/V_I_vc2/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0xdbbf4 (0x7ff4de4e0bf4 in /.conda/envs/V_I_vc2/bin/../lib/libstdc++.so.6) frame #5: + 0x8609 (0x7ff4e0071609 in /lib/x86_64-linux-gnu/libpthread.so.0) frame #6: clone + 0x43 (0x7ff4dfe3c353 in /lib/x86_64-linux-gnu/libc.so.6)

[rank1]:[E ProcessGroupNCCL.cpp:537] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank1]:[E ProcessGroupNCCL.cpp:543] To avoid data inconsistency, we are taking the entire process down. [rank1]:[E ProcessGroupNCCL.cpp:1182] [Rank 1] NCCL watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLGATHER, NumelIn=15728640, NumelOut=47185920, Timeout(ms)=600000) ran for 600149 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:525 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f56b609bd87 in /.conda/envs/V_I_vc2/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1e6 (0x7f56b72436e6 in /.conda/envs/V_I_vc2/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x19d (0x7f56b7246c3d in /.conda/envs/V_I_vc2/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x7f56b7247839 in /.conda/envs/V_I_vc2/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0xdbbf4 (0x7f5700f6dbf4 in /.conda/envs/V_I_vc2/bin/../lib/libstdc++.so.6) frame #5: + 0x8609 (0x7f5702afe609 in /lib/x86_64-linux-gnu/libpthread.so.0) frame #6: clone + 0x43 (0x7f57028c9353 in /lib/x86_64-linux-gnu/libc.so.6)

[rank2]:[E ProcessGroupNCCL.cpp:523] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLGATHER, NumelIn=15728640, NumelOut=47185920, Timeout(ms)=600000) ran for 600435 milliseconds before timing out. [rank2]:[E ProcessGroupNCCL.cpp:537] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank2]:[E ProcessGroupNCCL.cpp:543] To avoid data inconsistency, we are taking the entire process down. [rank2]:[E ProcessGroupNCCL.cpp:1182] [Rank 2] NCCL watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLGATHER, NumelIn=15728640, NumelOut=47185920, Timeout(ms)=600000) ran for 600435 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:525 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f80a5aadd87 in /.conda/envs/V_I_vc2/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1e6 (0x7f80a6c556e6 in /.conda/envs/V_I_vc2/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x19d (0x7f80a6c58c3d in /.conda/envs/V_I_vc2/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x7f80a6c59839 in /.conda/envs/V_I_vc2/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0xdbbf4 (0x7f80f097fbf4 in /.conda/envs/V_I_vc2/bin/../lib/libstdc++.so.6) frame #5: + 0x8609 (0x7f80f2510609 in /lib/x86_64-linux-gnu/libpthread.so.0) frame #6: clone + 0x43 (0x7f80f22db353 in /lib/x86_64-linux-gnu/libc.so.6)

Yuanshi9815 commented 1 month ago

This seems to be an issue related to Nvidia GPU communication. May I know

Did this issue occur specifically when using 3 GPUs? Does it also happen when using 2 or 4 GPUs?
It might be that the NCCL port is occupied. Could you try changing the master_port in the configuration file config.json to see if that resolves the issue?
Does this issue occur with multi-GPU and a single prompt as well?

YisuiTT commented 1 month ago

Sorry I didn't reply in time. I'm glad to take your suggestions, but unfortunately I've tried the above methods and still get NCLL errors.

Yuanshi9815 / Video-Infinity

Multi gpus problem #14