Open shh2000 opened 6 months ago
Can you provide the log with NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=GRAPH,INIT,ENV
for the run on env1 which failed?
Additional information:
Can you provide the log with
NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=GRAPH,INIT,ENV
for the run on env1 which failed?
Sure, i'll run with your envs and provide logs, thanks
@sjeaugey here's the log of noderank1(rank8-15). noderank1_log.txt
We've fixed a similar-looking bug in NCCL 2.21; can you try with the latest version?
@kiskra-nvidia Thanks for the information. we may try ngctorch:2404 or some other ways to upgrade nccl 2.21+.
By the way, is there any publicly disclosable reasons about nccl 2.20's bug?(Not for the problem itself, but out of curiosity for the technology involved). I find and guess maybe 2.20 found wrong MPI paths in some cases(drivers+nnodes+topo)? look forward to your reply and thanks!
Actually I'm not sure upgrading will help. The bug was a mixup of the connect with the following barrier and the barrier size was 8 bytes. Here all your sizes are more than 8.
The log you provided only shows one node. Could it be your environment was not forwarded to the other node? That would also explain the crash, as the other node might have a different configuration ending up in a mismatch and a discrepancy in sizes we're trying to exchange.
@sjeaugey Hi, my 3 nodes has the same baremetal config(8 H100+4 activated(8 in all) HDR NIC+2 CPU+PCIE5), with containers run from the same images(ngctorch2403+megatroncore0.6.0). If your guess is true, can my bug be reproduced by testing the P2P between every two ranks(C_{24}^2=24*23/2=276 cases)? By the way, if 276 p2p comm is all ok, would it face bug when using specific 5 rank to do all-reduce?
env
test code