NVIDIA / nccl

Optimized primitives for collective multi-GPU communication
Other
3.24k stars 817 forks source link

NCCL stuck when using nccl-test. #1289

Open deepzzz123 opened 5 months ago

deepzzz123 commented 5 months ago

hi, developer.

I meet some stuck problem while using nccl-test to test. The detail: i have done all step follw by https://github.com/NVIDIA/nccl. but when i run "./build/all_reduce_perf -b 8 -e 128M -f 2 -g 4", the output stuck for a long time and no other text print just as the following image: struck

Also with 100% GPU utilize: 100%GPU UTL

Howerver, when i run one gpu test with "./build/all_reduce_perf -b 8 -e 128M -f 2 -g 1" , the nccl-test output is OK. 1gpus-runok

the environment info:

PyTorch version: 2.3.0+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.4 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: version 3.26.6
Libc version: glibc-2.35

Python version: 3.11.5 (main, Sep 11 2023, 13:54:46) [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-6.5.0-35-generic-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: 12.2.91
CUDA_MODULE_LOADING set to: LAZY

I also tried to re-install the nvidia-driver & cuda. but this issuse still exists. this problem have puzzle me for long time, could anyone give me some hint or advice about how i could debug/solve this problem?

kiskra-nvidia commented 5 months ago

Please rerun it with the debug info enabled:

NCCL_DEBUG=INFO ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 4

or maybe even:

NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=ALL ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 4

(the latter will generate more output, but hopefully still within reason).

Redirect the output to a file and attach the file to this bug.

Out of curiosity, is it also getting stuck with just 2 GPUs? That should cut the output file size in half...

deepzzz123 commented 5 months ago

Thanks for you reply. I tried the "NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=ALL ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 4" and get the debug info. but it's hard for me to figure out what it mean. could you take some analysis about it.... test-4gpus.log 2 GPUs also tried, but meet the same issuse as 4GPUs. The debug info: test-2gpus.log

Please rerun it with the debug info enabled:

NCCL_DEBUG=INFO ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 4

or maybe even:

NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=ALL ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 4

(the latter will generate more output, but hopefully still within reason).

Redirect the output to a file and attach the file to this bug.

Out of curiosity, is it also getting stuck with just 2 GPUs? That should cut the output file size in half...

kiskra-nvidia commented 5 months ago

Thank you for providing the log files. I don't see any fatal errors in them; it looks as if the hung happens right after the GPUs start communicating with each other. It makes me wonder if you might be dealing with some node misconfiguration? Does peer-to-peer communication (between the GPUs) work on this node at all? The NCCL troubleshooting section shows examples of using p2pBandwidthLatencyTest and nvbandwidth to verify that communication between GPUs works as expected; I suggest that you try them.