NVIDIA / nccl-tests

NCCL Tests
BSD 3-Clause "New" or "Revised" License
807 stars 229 forks source link

all_reduce_perf core dumped on 4 L20 #233

Closed songh11 closed 2 weeks ago

songh11 commented 1 month ago

Command: NCCL_DEBUG=INFO ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 4 Output:

image

Here is my env:

image
songh11 commented 1 month ago

I found nccv-2.17.1 can work, but v2.22.3 will core dumped. I wonder what is the reason for this, I need your help.

sjeaugey commented 1 month ago

Can you analyze the coredump with:

gdb ./build/all_reduce_perf <core file>

Then inside gdb:

bt

That would help us a lot. Thanks!

songh11 commented 1 month ago

Can you analyze the coredump with:

gdb ./build/all_reduce_perf <core file>

Then inside gdb:

bt

That would help us a lot. Thanks!

Thank you for your reply, this is my output

Details

(gdb) set args -b 8 -e 128M -f 2 -g 4 (gdb) r Starting program: /home/sh9/workspace/nccl-tests/build/all_reduce_perf -b 8 -e 128M -f 2 -g 4 [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". \# nThread 1 nGpus 4 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0 # \# Using devices [New Thread 0x7fffe09aa000 (LWP 376734)] \# Rank 0 Group 0 Pid 376717 on iv-yd9wibvbpc5i3z3c67ow device 0 [0x65] NVIDIA L20 \# Rank 1 Group 0 Pid 376717 on iv-yd9wibvbpc5i3z3c67ow device 1 [0x67] NVIDIA L20 \# Rank 2 Group 0 Pid 376717 on iv-yd9wibvbpc5i3z3c67ow device 2 [0x69] NVIDIA L20 \# Rank 3 Group 0 Pid 376717 on iv-yd9wibvbpc5i3z3c67ow device 3 [0x6b] NVIDIA L20 [New Thread 0x7fffcffff000 (LWP 376735)] [New Thread 0x7fffcf7fe000 (LWP 376736)] [New Thread 0x7fffce990000 (LWP 376737)] [New Thread 0x7fffcdcee000 (LWP 376738)] [New Thread 0x7fffcd04c000 (LWP 376739)] [New Thread 0x7fffc1fff000 (LWP 376740)] [New Thread 0x7fffc17fe000 (LWP 376741)] [New Thread 0x7fffc0ffd000 (LWP 376742)] [New Thread 0x7fff91fff000 (LWP 376743)] [New Thread 0x7fff917fe000 (LWP 376744)] [New Thread 0x7fff90ffd000 (LWP 376750)] [Thread 0x7fffc1fff000 (LWP 376740) exited] Thread 10 "all_reduce_perf" received signal SIGSEGV, Segmentation fault. [Switching to Thread 0x7fffc0ffd000 (LWP 376742)] 0x00007fffe97c4161 in addInterStep (system=system@entry=0x7ffefb432990, tx=tx@entry=3, ix=, t1=t1@entry=0, i1=i1@entry=1, t2=t2@entry=0, i2=0) at graph/paths.cc:186 186 for (int i=0; ipaths[t2][i2].count; i++) srcNode->paths[t2][i2].list[l++] = cpuNode->paths[t2][i2].list[i]; (gdb) bt #0 0x00007fffe97c4161 in addInterStep (system=system@entry=0x7ffefb432990, tx=tx@entry=3, ix=, t1=t1@entry=0, i1=i1@entry=1, t2=t2@entry=0, i2=0) at graph/paths.cc:186 #1 0x00007fffe97c6e19 in ncclTopoComputePaths (system=0x7ffefb432990, comm=comm@entry=0x55555837a650) at graph/paths.cc:579 #2 0x00007fffe97893ef in initTransportsRank (comm=comm@entry=0x55555837a650, parent=0x0, timers=timers@entry=0x7fffc0ff67c0) at init.cc:811 #3 0x00007fffe978d5f9 in ncclCommInitRankFunc (job_=0x5555583b1890) at init.cc:1408 #4 0x00007fffe97824bc in ncclAsyncJobMain (arg=0x5555583b1890) at group.cc:68 #5 0x00007fffe970b609 in start_thread (arg=) at pthread_create.c:477 #6 0x00007fffe92ff353 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

sjeaugey commented 1 month ago

Would you be able to dump the node topology with NCCL_TOPO_DUMP_FILE=system.txt and post the system.txt here? That should allow us to reproduce the issue.

songh11 commented 1 month ago

Would you be able to dump the node topology with NCCL_TOPO_DUMP_FILE=system.txt and post the system.txt here? That should allow us to reproduce the issue.

Here is the system.txt system.txt

kiskra-nvidia commented 1 month ago

Thank you! With the information you provided, we were able to reproduce the problem locally. This is an issue we've encountered before and we hope to include a fix in the next NCCL release!

songh11 commented 1 month ago

Thank you! With the information you provided, we were able to reproduce the problem locally. This is an issue we've encountered before and we hope to include a fix in the next NCCL release!

Thanks for your reply, could you tell me what caused the problem

kiskra-nvidia commented 1 month ago

It's due to a somewhat unusual topology of your system. You have GPUs without P2P capability attached under one NUMA node and then the NIC is under another NUMA node. Our graph searching code gets confused by it, but the fix is actually a trivial one-liner:

--- a/src/graph/paths.cc
+++ b/src/graph/paths.cc
@@ -162,7 +162,7 @@ static ncclResult_t getLocalCpu(struct ncclTopoSystem* system, int gpu, int* ret
   struct ncclTopoLinkList* paths = system->nodes[GPU].nodes[gpu].paths[CPU];
   for (int c=0; c<system->nodes[CPU].count; c++) {
     int hops = paths[c].count;
-    if (minHops == 0 || hops < minHops) {
+    if ((minHops == 0 || hops < minHops) && paths[c].type < PATH_NET) {
       localCpu = c;
       minHops = hops;
     }

Can you try the above and see if it fixes things for you?

BTW, unusually, the numaid being reported in the topo file for the NUMA node with the NIC is -1. I was going to follow up with you on that, especially since your nvidia-smi topo -m output shows only one NUMA node. What does numactl -H show?

kiskra-nvidia commented 1 month ago

Could you also post the output of lspci -tv? Thanks!

songh11 commented 1 month ago

Could you also post the output of lspci -tv? Thanks!

Thank you, and numactl -H & lspci -tv shows:

image
songh11 commented 1 month ago

It's due to a somewhat unusual topology of your system. You have GPUs without P2P capability attached under one NUMA node and then the NIC is under another NUMA node. Our graph searching code gets confused by it, but the fix is actually a trivial one-liner:

--- a/src/graph/paths.cc
+++ b/src/graph/paths.cc
@@ -162,7 +162,7 @@ static ncclResult_t getLocalCpu(struct ncclTopoSystem* system, int gpu, int* ret
   struct ncclTopoLinkList* paths = system->nodes[GPU].nodes[gpu].paths[CPU];
   for (int c=0; c<system->nodes[CPU].count; c++) {
     int hops = paths[c].count;
-    if (minHops == 0 || hops < minHops) {
+    if ((minHops == 0 || hops < minHops) && paths[c].type < PATH_NET) {
       localCpu = c;
       minHops = hops;
     }

Can you try the above and see if it fixes things for you?

BTW, unusually, the numaid being reported in the topo file for the NUMA node with the NIC is -1. I was going to follow up with you on that, especially since your nvidia-smi topo -m output shows only one NUMA node. What does numactl -H show?

I tried it, and it looks like the mistake went to another place.

Details

(base) sh9@iv-yd9wibvbpc5i3z3c67ow:~/workspace/nccl-tests$ gdb ./build/all_reduce_perf GNU gdb (Ubuntu 9.2-0ubuntu1~20.04.2) 9.2 Copyright (C) 2020 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-linux-gnu". Type "show configuration" for configuration details. For bug reporting instructions, please see: . Find the GDB manual and other documentation resources online at: . For help, type "help". Type "apropos word" to search for commands related to "word"... Reading symbols from ./build/all_reduce_perf... (gdb) set ar architecture args (gdb) set args -b 8 -e 128M -f 2 -g 4 (gdb) r Starting program: /home/sh9/workspace/nccl-tests/build/all_reduce_perf -b 8 -e 128M -f 2 -g 4 [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". \# nThread 1 nGpus 4 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0 # \# Using devices [New Thread 0x7fffe09aa000 (LWP 961886)] \# Rank 0 Group 0 Pid 961869 on iv-yd9wibvbpc5i3z3c67ow device 0 [0x65] NVIDIA L20 \# Rank 1 Group 0 Pid 961869 on iv-yd9wibvbpc5i3z3c67ow device 1 [0x67] NVIDIA L20 \# Rank 2 Group 0 Pid 961869 on iv-yd9wibvbpc5i3z3c67ow device 2 [0x69] NVIDIA L20 \# Rank 3 Group 0 Pid 961869 on iv-yd9wibvbpc5i3z3c67ow device 3 [0x6b] NVIDIA L20 [New Thread 0x7fffcffff000 (LWP 961887)] [New Thread 0x7fffcf7fe000 (LWP 961888)] [New Thread 0x7fffce990000 (LWP 961889)] [New Thread 0x7fffcdcee000 (LWP 961890)] [New Thread 0x7fffcd04c000 (LWP 961891)] [New Thread 0x7fffc1fff000 (LWP 961897)] [New Thread 0x7fffc17fe000 (LWP 961898)] [New Thread 0x7fffc0ffd000 (LWP 961899)] [New Thread 0x7fff91fff000 (LWP 961900)] [New Thread 0x7fff917fe000 (LWP 961901)] [New Thread 0x7fff90ffd000 (LWP 961902)] [Thread 0x7fffc1fff000 (LWP 961897) exited] Thread 10 "all_reduce_perf" received signal SIGSEGV, Segmentation fault. [Switching to Thread 0x7fffc0ffd000 (LWP 961899)] **ncclTopoCheckNet (system=system@entry=0x7ffef3432930, id1=421904, id2=413712, net=net@entry=0x7fffc0ff6084) at graph/paths.cc:462** 462 if (path->type <= PATH_PXB && path->bw > netSpeed1) netSpeed1 = path->bw; (gdb) bt #0 ncclTopoCheckNet (system=system@entry=0x7ffef3432930, id1=421904, id2=413712, net=net@entry=0x7fffc0ff6084) at graph/paths.cc:462 #1 0x00007fffe97e1ed6 in shmCanConnect (graph=, info2=0x7ffef3432700, info1=0x7ffef3432768, topo=0x7ffef3432930, ret=0x7fffc0ff6114) at transport/shm.cc:58 #2 shmCanConnect (ret=0x7fffc0ff6114, topo=0x7ffef3432930, graph=, info1=0x7ffef3432768, info2=0x7ffef3432700) at transport/shm.cc:51 #3 0x00007fffe97c6f94 in ncclTopoComputePaths (system=0x7ffef3432930, comm=comm@entry=0x55555837a620) at graph/paths.cc:593 #4 0x00007fffe97893ef in initTransportsRank (comm=comm@entry=0x55555837a620, parent=0x0, timers=timers@entry=0x7fffc0ff67c0) at init.cc:811 #5 0x00007fffe978d5f9 in ncclCommInitRankFunc (job_=0x5555583b1860) at init.cc:1408 #6 0x00007fffe97824bc in ncclAsyncJobMain (arg=0x5555583b1860) at group.cc:68 #7 0x00007fffe970b609 in start_thread (arg=) at pthread_create.c:477 #8 0x00007fffe92ff353 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

kiskra-nvidia commented 1 month ago

Right, sorry about that! I saw that the code was crashing due to a known bug but I haven't verified that it wouldn't crash due to another bug further down...

@sjeaugey, I could use your help to figure out the root cause here. Basically, NCCL is crashing all over the place because of many paths being missing in the graph -- including the paths between GPUs and the (only) NET. That NET is also showing up in the topo dump (see https://github.com/NVIDIA/nccl-tests/issues/233#issuecomment-2232726311) as being attached to a fake CPU numa node (id -1). Is this because this is being run in a VM and an appropriate topo file should always be loaded when invoking NCCL? Or is the VM misconfigured? In particular, lspci -tv is not showing the root PCI bridge (see https://github.com/NVIDIA/nccl-tests/issues/233#issuecomment-2237852934) -- is that what's causing all the paths to be SYS in the output of nvidia-smi topo -m (see https://github.com/NVIDIA/nccl-tests/issues/233#issue-2410060243)? What's the right way to address it?

sjeaugey commented 1 month ago

_That's weird, I can't repro the issue with the topo file. @songh11 could you run outside of GDB with NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=INIT,ENV,GRAPH and provide the log?_

Edit: I can repro actually, my bad. No need to provide the log.

@kiskra-nvidia devices being attached to CPU -1 is indeed common in VMs, for virtual devices. It should not be a problem. I see we're also failing to figure out the PCI width (width="0"), but again, that should fall back to default values.

sjeaugey commented 1 month ago

One weird thing though: the NIC shows up as:

      <pci busid="0000:00:00.0" class="0x060000" vendor="0x8086" device="0x1237" subsystem_vendor="0x1af4" subsystem_device="0x1100" link_speed="" link_width="0">

Usually NICs have a PCI class of 0x02XXXX, not 0x060000. That may be causing trouble down the line.

sjeaugey commented 1 month ago

Ok, it looks like all our problems come from the fact that the "unknown NUMA node" a.k.a. "-1" translates into "Node -1 / NUMA node -1" which is causing the new code to consider the NIC and GPU are on different nodes.

Using 0xffff instead of -1 should fix the problem:

diff --git a/src/graph/xml.cc b/src/graph/xml.cc
index c2c6a1c81..d6fd91dfe 100644
--- a/src/graph/xml.cc
+++ b/src/graph/xml.cc
@@ -678,13 +678,13 @@ ncclResult_t ncclTopoGetXmlFromSys(struct ncclXmlNode* pciNode, struct ncclXml*
       }
     } else {
       // No information on /sys, attach GPU to unknown CPU
-      NCCLCHECK(xmlFindTagKv(xml, "cpu", &parent, "numaid", "-1"));
+      NCCLCHECK(xmlFindTagKv(xml, "cpu", &parent, "numaid", "0xffff"));
       if (parent == NULL) {
         struct ncclXmlNode* topNode;
         NCCLCHECK(xmlFindTag(xml, "system", &topNode));
         NCCLCHECK(xmlAddNode(xml, topNode, "cpu", &parent));
         NCCLCHECK(xmlSetAttrLong(parent, "host_hash", getHostHash()));
-        NCCLCHECK(xmlSetAttr(parent, "numaid", "-1"));
+        NCCLCHECK(xmlSetAttr(parent, "numaid", "0xffff"));
         NCCLCHECK(ncclTopoGetXmlFromCpu(parent, xml));
       }
     }
Tison-Liu commented 1 month ago

@songh11 By the way, I use the same GPU as you, and the environment is dual-machine and four-card, but my all reduce test can only reach about 22GB/s, I want to refer to your test results, is this normal?

songh11 commented 1 month ago

@songh11 By the way, I use the same GPU as you, and the environment is dual-machine and four-card, but my all reduce test can only reach about 22GB/s, I want to refer to your test results, is this normal?

I only used one machine and four L20, and I can run on nccl-version2.17.1. My all reduce test about 17GB/s, I think there might be something wrong with my results.

songh11 commented 1 month ago

_That's weird, I can't repro the issue with the topo file. @songh11 could you run outside of GDB with NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=INIT,ENV,GRAPH and provide the log?_

Edit: I can repro actually, my bad. No need to provide the log.

@kiskra-nvidia devices being attached to CPU -1 is indeed common in VMs, for virtual devices. It should not be a problem. I see we're also failing to figure out the PCI width (width="0"), but again, that should fall back to default values.

Thanks you, and this is my log:

Details

\# nThread 1 nGpus 4 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0 # \# Using devices \# Rank 0 Group 0 Pid 1539966 on iv-yd9wibvbpc5i3z3c67ow device 0 [0x65] NVIDIA L20 \# Rank 1 Group 0 Pid 1539966 on iv-yd9wibvbpc5i3z3c67ow device 1 [0x67] NVIDIA L20 \# Rank 2 Group 0 Pid 1539966 on iv-yd9wibvbpc5i3z3c67ow device 2 [0x69] NVIDIA L20 \# Rank 3 Group 0 Pid 1539966 on iv-yd9wibvbpc5i3z3c67ow device 3 [0x6b] NVIDIA L20 iv-yd9wibvbpc5i3z3c67ow:1539966:1539966 [0] NCCL INFO Bootstrap : Using eth0:10.41.72.35<0> iv-yd9wibvbpc5i3z3c67ow:1539966:1539966 [0] NCCL INFO cudaDriverVersion 12020 iv-yd9wibvbpc5i3z3c67ow:1539966:1539966 [3] NCCL INFO NCCL version 2.22.3+cuda12.1 iv-yd9wibvbpc5i3z3c67ow:1539966:1539989 [1] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. Using internal network plugin. iv-yd9wibvbpc5i3z3c67ow:1539966:1539989 [1] NCCL INFO NET/IB : Using [0]mlx5_0:1/RoCE [RO]; OOB eth0:10.41.72.35<0> iv-yd9wibvbpc5i3z3c67ow:1539966:1539989 [1] NCCL INFO Using network IB iv-yd9wibvbpc5i3z3c67ow:1539966:1539988 [0] NCCL INFO Using network IB iv-yd9wibvbpc5i3z3c67ow:1539966:1539990 [2] NCCL INFO Using network IB iv-yd9wibvbpc5i3z3c67ow:1539966:1539991 [3] NCCL INFO Using network IB iv-yd9wibvbpc5i3z3c67ow:1539966:1539988 [0] NCCL INFO ncclCommInitRank comm 0x55b0a9764440 rank 0 nranks 4 cudaDev 0 nvmlDev 0 busId 65010 commId 0xa3c2b2129f65fbe0 - Init START iv-yd9wibvbpc5i3z3c67ow:1539966:1539990 [2] NCCL INFO ncclCommInitRank comm 0x55b0a97d4820 rank 2 nranks 4 cudaDev 2 nvmlDev 2 busId 69010 commId 0xa3c2b2129f65fbe0 - Init START iv-yd9wibvbpc5i3z3c67ow:1539966:1539991 [3] NCCL INFO ncclCommInitRank comm 0x55b0a980bb40 rank 3 nranks 4 cudaDev 3 nvmlDev 3 busId 6b010 commId 0xa3c2b2129f65fbe0 - Init START iv-yd9wibvbpc5i3z3c67ow:1539966:1539989 [1] NCCL INFO ncclCommInitRank comm 0x55b0a979d4c0 rank 1 nranks 4 cudaDev 1 nvmlDev 1 busId 67010 commId 0xa3c2b2129f65fbe0 - Init START iv-yd9wibvbpc5i3z3c67ow:1539966:1539989 [1] NCCL INFO Topology detection : could not read /sys/devices/pci0000:66/0000:66:00.0/0000:67:01.0/../max_link_speed, ignoring iv-yd9wibvbpc5i3z3c67ow:1539966:1539988 [0] NCCL INFO Topology detection : could not read /sys/devices/pci0000:64/0000:64:00.0/0000:65:01.0/../max_link_speed, ignoring iv-yd9wibvbpc5i3z3c67ow:1539966:1539990 [2] NCCL INFO Topology detection : could not read /sys/devices/pci0000:68/0000:68:00.0/0000:69:01.0/../max_link_speed, ignoring iv-yd9wibvbpc5i3z3c67ow:1539966:1539991 [3] NCCL INFO Topology detection : could not read /sys/devices/pci0000:6a/0000:6a:00.0/0000:6b:01.0/../max_link_speed, ignoring iv-yd9wibvbpc5i3z3c67ow:1539966:1539988 [0] NCCL INFO Topology detection : could not read /sys/devices/pci0000:64/0000:64:00.0/0000:65:01.0/../max_link_width, ignoring iv-yd9wibvbpc5i3z3c67ow:1539966:1539989 [1] NCCL INFO Topology detection : could not read /sys/devices/pci0000:66/0000:66:00.0/0000:67:01.0/../max_link_width, ignoring iv-yd9wibvbpc5i3z3c67ow:1539966:1539990 [2] NCCL INFO Topology detection : could not read /sys/devices/pci0000:68/0000:68:00.0/0000:69:01.0/../max_link_width, ignoring iv-yd9wibvbpc5i3z3c67ow:1539966:1539991 [3] NCCL INFO Topology detection : could not read /sys/devices/pci0000:6a/0000:6a:00.0/0000:6b:01.0/../max_link_width, ignoring iv-yd9wibvbpc5i3z3c67ow:1539966:1539988 [0] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:00.0/max_link_speed, ignoring iv-yd9wibvbpc5i3z3c67ow:1539966:1539988 [0] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:00.0/../max_link_speed, ignoring iv-yd9wibvbpc5i3z3c67ow:1539966:1539988 [0] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:00.0/max_link_width, ignoring iv-yd9wibvbpc5i3z3c67ow:1539966:1539988 [0] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:00.0/../max_link_width, ignoring iv-yd9wibvbpc5i3z3c67ow:1539966:1539988 [0] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring iv-yd9wibvbpc5i3z3c67ow:1539966:1539989 [1] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:00.0/max_link_speed, ignoring iv-yd9wibvbpc5i3z3c67ow:1539966:1539989 [1] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:00.0/../max_link_speed, ignoring iv-yd9wibvbpc5i3z3c67ow:1539966:1539989 [1] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:00.0/max_link_width, ignoring iv-yd9wibvbpc5i3z3c67ow:1539966:1539989 [1] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:00.0/../max_link_width, ignoring iv-yd9wibvbpc5i3z3c67ow:1539966:1539989 [1] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring iv-yd9wibvbpc5i3z3c67ow:1539966:1539991 [3] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:00.0/max_link_speed, ignoring iv-yd9wibvbpc5i3z3c67ow:1539966:1539991 [3] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:00.0/../max_link_speed, ignoring iv-yd9wibvbpc5i3z3c67ow:1539966:1539991 [3] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:00.0/max_link_width, ignoring iv-yd9wibvbpc5i3z3c67ow:1539966:1539991 [3] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:00.0/../max_link_width, ignoring iv-yd9wibvbpc5i3z3c67ow:1539966:1539991 [3] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring iv-yd9wibvbpc5i3z3c67ow:1539966:1539990 [2] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:00.0/max_link_speed, ignoring iv-yd9wibvbpc5i3z3c67ow:1539966:1539990 [2] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:00.0/../max_link_speed, ignoring iv-yd9wibvbpc5i3z3c67ow:1539966:1539990 [2] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:00.0/max_link_width, ignoring iv-yd9wibvbpc5i3z3c67ow:1539966:1539990 [2] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:00.0/../max_link_width, ignoring iv-yd9wibvbpc5i3z3c67ow:1539966:1539990 [2] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring iv-yd9wibvbpc5i3z3c67ow:1539966:1539989 [1] NCCL INFO KV Convert to int : could not find value of '0x060000' in dictionary, falling back to 1 iv-yd9wibvbpc5i3z3c67ow:1539966:1539989 [1] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60 iv-yd9wibvbpc5i3z3c67ow:1539966:1539988 [0] NCCL INFO KV Convert to int : could not find value of '0x060000' in dictionary, falling back to 1 iv-yd9wibvbpc5i3z3c67ow:1539966:1539991 [3] NCCL INFO KV Convert to int : could not find value of '0x060000' in dictionary, falling back to 1 iv-yd9wibvbpc5i3z3c67ow:1539966:1539988 [0] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60 iv-yd9wibvbpc5i3z3c67ow:1539966:1539991 [3] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60 iv-yd9wibvbpc5i3z3c67ow:1539966:1539990 [2] NCCL INFO KV Convert to int : could not find value of '0x060000' in dictionary, falling back to 1 iv-yd9wibvbpc5i3z3c67ow:1539966:1539990 [2] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60 Segmentation fault (core dumped)

songh11 commented 1 month ago

Ok, it looks like all our problems come from the fact that the "unknown NUMA node" a.k.a. "-1" translates into "Node -1 / NUMA node -1" which is causing the new code to consider the NIC and GPU are on different nodes.

Using 0xffff instead of -1 should fix the problem:

diff --git a/src/graph/xml.cc b/src/graph/xml.cc
index c2c6a1c81..d6fd91dfe 100644
--- a/src/graph/xml.cc
+++ b/src/graph/xml.cc
@@ -678,13 +678,13 @@ ncclResult_t ncclTopoGetXmlFromSys(struct ncclXmlNode* pciNode, struct ncclXml*
       }
     } else {
       // No information on /sys, attach GPU to unknown CPU
-      NCCLCHECK(xmlFindTagKv(xml, "cpu", &parent, "numaid", "-1"));
+      NCCLCHECK(xmlFindTagKv(xml, "cpu", &parent, "numaid", "0xffff"));
       if (parent == NULL) {
         struct ncclXmlNode* topNode;
         NCCLCHECK(xmlFindTag(xml, "system", &topNode));
         NCCLCHECK(xmlAddNode(xml, topNode, "cpu", &parent));
         NCCLCHECK(xmlSetAttrLong(parent, "host_hash", getHostHash()));
-        NCCLCHECK(xmlSetAttr(parent, "numaid", "-1"));
+        NCCLCHECK(xmlSetAttr(parent, "numaid", "0xffff"));
         NCCLCHECK(ncclTopoGetXmlFromCpu(parent, xml));
       }
     }

Thank you for your help, but It's weird that this one also gives an error.

sjeaugey commented 1 month ago

Do you mean that this change did not fix your problem? If that is the case, can you provide the log?

kiskra-nvidia commented 1 month ago

Can you also provide the NCCL_TOPO_DUMP_FILE with @sjeaugey's fix in place?

kiskra-nvidia commented 2 weeks ago

@songh11 Are you still there? We have an alternative patch for you to try if you are still facing this problem.

songh11 commented 2 weeks ago

@songh11 Are you still there? We have an alternative patch for you to try if you are still facing this problem.

I'm sorry for replying so late, because my machine has expired, so I can't test it now. You can turn this question off and I'll try again later. Thanks a lot.