Open deepzzz123 opened 5 months ago
Please rerun it with the debug info enabled:
NCCL_DEBUG=INFO ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 4
or maybe even:
NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=ALL ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 4
(the latter will generate more output, but hopefully still within reason).
Redirect the output to a file and attach the file to this bug.
Out of curiosity, is it also getting stuck with just 2 GPUs? That should cut the output file size in half...
Thanks for you reply. I tried the "NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=ALL ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 4" and get the debug info. but it's hard for me to figure out what it mean. could you take some analysis about it.... test-4gpus.log 2 GPUs also tried, but meet the same issuse as 4GPUs. The debug info: test-2gpus.log
Please rerun it with the debug info enabled:
NCCL_DEBUG=INFO ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 4
or maybe even:
NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=ALL ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 4
(the latter will generate more output, but hopefully still within reason).
Redirect the output to a file and attach the file to this bug.
Out of curiosity, is it also getting stuck with just 2 GPUs? That should cut the output file size in half...
Thank you for providing the log files. I don't see any fatal errors in them; it looks as if the hung happens right after the GPUs start communicating with each other. It makes me wonder if you might be dealing with some node misconfiguration? Does peer-to-peer communication (between the GPUs) work on this node at all? The NCCL troubleshooting section shows examples of using p2pBandwidthLatencyTest
and nvbandwidth
to verify that communication between GPUs works as expected; I suggest that you try them.
hi, developer.
I meet some stuck problem while using nccl-test to test. The detail: i have done all step follw by https://github.com/NVIDIA/nccl. but when i run "./build/all_reduce_perf -b 8 -e 128M -f 2 -g 4", the output stuck for a long time and no other text print just as the following image:
Also with 100% GPU utilize:
Howerver, when i run one gpu test with "./build/all_reduce_perf -b 8 -e 128M -f 2 -g 1" , the nccl-test output is OK.
the environment info:
I also tried to re-install the nvidia-driver & cuda. but this issuse still exists. this problem have puzzle me for long time, could anyone give me some hint or advice about how i could debug/solve this problem?