Open ljz756245026 opened 3 years ago
This is coming from the topology detection in NCCL. You seem to be using GPU with PCI ID 0000:1e:00.0; can you check whether /sys/class/pci_bus/0000:1e
exists in your NCCL environment, and if so where it points to, and whether from there you can cd to ../../0000:1e:00.0
?
Or run this and see where it fails:
realpath /sys/class/pci_bus/0000:1e
cd `realpath /sys/class/pci_bus/0000:1e`
cd ../../0000:1e:00.0
This is coming from the topology detection in NCCL. You seem to be using GPU with PCI ID 0000:1e:00.0; can you check whether
/sys/class/pci_bus/0000:1e
exists in your NCCL environment, and if so where it points to, and whether from there you can cd to../../0000:1e:00.0
?Or run this and see where it fails:
realpath /sys/class/pci_bus/0000:1e cd `realpath /sys/class/pci_bus/0000:1e` cd ../../0000:1e:00.0
I checked the file path. There is no such file there.
Could it be /sys is not mounted in your environment somehow?
I have similar issue where I am trying to run pytorch model in one of WSL2-based nvidia-pytorch container: nvcr.io/nvidia/pytorch:21.10-py3
Unfortunatelly I failed to do so:
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1614378083779/work/torch/lib/c10d/ProcessGroupNCCL.cpp:825, unhandled system error, NCCL version 2.7.8 ncclSystemError: System call (socket, malloc, munmap, etc) failed.
I managed to get some logs, probably error is related to the two NCCL warnings:
0a3ec01864bc:1:1 [0] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v3 symbol.
0a3ec01864bc:1:29 [0] graph/xml.cc:332 NCCL WARN Could not find real path of /sys/class/pci_bus/0000:01/../../0000:01:00.0
My .sys/class/pci_buss
directory does not have 0000:01
directory. There are some others but not that one specifically.
Do you have any idea what should I do next?
pytorch==1.8.0
cudatoolkit==11.1.1
python==3.8.5
NCCL version 2.7.8+cuda11.1
Yes, this is the error:
0a3ec01864bc:1:29 [0] graph/xml.cc:332 NCCL WARN Could not find real path of /sys/class/pci_bus/0000:01/../../0000:01:00.0
So you have a GPU which has a PCI ID of 0000:01, yet it is not visible in /sys/class/pci_bus. This is probably a problem with the configuration of your VM or container. Which VM/container system are you using?
Docker Desktop with WSL2, so it is probably related with #442 ?
Ah, indeed, WSL does not report topology in /sys. You need to try with a more recent version of NCCL which will ignore the PCI topology if not found.
Ah, indeed, WSL does not report topology in /sys. You need to try with a more recent version of NCCL which will ignore the PCI topology if not found.
I received the same error:
ASUS:3049:3049 [0] NCCL INFO Bootstrap : Using [0]lo:127.0.0.1<0>
ASUS:3049:3049 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
ASUS:3049:3049 [0] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
ASUS:3049:3049 [0] NCCL INFO NET/Socket : Using [0]lo:127.0.0.1<0>
ASUS:3049:3049 [0] NCCL INFO Using network Socket
NCCL version 2.7.8+cuda10.2
ASUS:3049:3102 [0] graph/xml.cc:332 NCCL WARN Could not find real path of /sys/class/pci_bus/0000:01/../../0000:01:00.0
ASUS:3049:3102 [0] NCCL INFO graph/xml.cc:469 -> 2
ASUS:3049:3102 [0] NCCL INFO graph/xml.cc:660 -> 2
ASUS:3049:3102 [0] NCCL INFO graph/topo.cc:523 -> 2
ASUS:3049:3102 [0] NCCL INFO init.cc:581 -> 2
ASUS:3049:3102 [0] NCCL INFO init.cc:840 -> 2
ASUS:3049:3102 [0] NCCL INFO group.cc:73 -> 2 [Async thread]
But I have already download the latest version of NCCL, which is version 2.8.3, but it still goes wrong My environment is
Ubuntu18.04 on WSL2
pleace help me...
The latest version of NCCL is 2.16. You can always rebuild it inside WSL2 or download it from https://developer.nvidia.com/nccl.
Same issue here. I am using a remote Ubuntu 20.04 VM with 2 A100s.
bfclient02:14681:14681 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
bfclient02:14681:14681 [0] NCCL INFO NET/IB : No device found.
bfclient02:14681:14681 [0] NCCL INFO NET/Socket : Using [0]ens160:158.132.102.178<0>
bfclient02:14681:14681 [0] NCCL INFO Using network Socket
NCCL version 2.7.8+cuda11.1
bfclient02:14682:14682 [1] NCCL INFO Bootstrap : Using [0]ens160:158.132.102.178<0>
bfclient02:14682:14682 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
bfclient02:14682:14682 [1] NCCL INFO NET/IB : No device found.
bfclient02:14682:14682 [1] NCCL INFO NET/Socket : Using [0]ens160:158.132.102.178<0>
bfclient02:14682:14682 [1] NCCL INFO Using network Socket
bfclient02:14681:14750 [0] graph/xml.cc:332 NCCL WARN Could not find real path of /sys/class/pci_bus/0000:0d/../../0000:0d:00.0
bfclient02:14681:14750 [0] NCCL INFO graph/xml.cc:648 -> 2
bfclient02:14681:14750 [0] NCCL INFO graph/xml.cc:665 -> 2
bfclient02:14681:14750 [0] NCCL INFO graph/topo.cc:523 -> 2
bfclient02:14681:14750 [0] NCCL INFO init.cc:581 -> 2
bfclient02:14681:14750 [0] NCCL INFO init.cc:840 -> 2
bfclient02:14681:14750 [0] NCCL INFO group.cc:73 -> 2 [Async thread]
bfclient02:14682:14752 [1] graph/xml.cc:332 NCCL WARN Could not find real path of /sys/class/pci_bus/0000:0d/../../0000:0d:00.0
bfclient02:14682:14752 [1] NCCL INFO graph/xml.cc:648 -> 2
bfclient02:14682:14752 [1] NCCL INFO graph/xml.cc:665 -> 2
bfclient02:14682:14752 [1] NCCL INFO graph/topo.cc:523 -> 2
bfclient02:14682:14752 [1] NCCL INFO init.cc:581 -> 2
bfclient02:14682:14752 [1] NCCL INFO init.cc:840 -> 2
bfclient02:14682:14752 [1] NCCL INFO group.cc:73 -> 2 [Async thread]
I checked if the two paths exists, but they don't. Here's what I see.
realpath /sys/class/pci_bus/0000:0d
returns:
/sys/devices/pci0000:00/0000:00:16.2/pci_bus/0000:0d
then cd ../../ && ls
returns:
0000:00:16.2:pcie001 config device firmware_node max_link_speed numa_node rescan secondary_bus_number uevent
0000:00:16.2:pcie004 consistent_dma_mask_bits dma_mask_bits irq max_link_width pci_bus reset subordinate_bus_number vendor
ari_enabled current_link_speed driver link modalias power reset_method subsystem
broken_parity_status current_link_width driver_override local_cpulist msi_bus power_state resource subsystem_device
class d3cold_allowed enable local_cpus msi_irqs remove revision subsystem_vendor
I'm guessing 0000:00:16.2:pcie001
and 0000:00:16.2:pcie004
are the A100s, but I am not too sure. Any clues?
@sjeaugey, you mentioned above that "Could it be /sys is not mounted in your environment somehow?". Could you walk me through how I could check for that? Thanks in advance.
Uh. Interesting. First time I see that kind of "PCI ID" with the classic "ABCD:EF:GH.I" but also a ":xxxxxxx" suffix. Supporting that will be complicated, as we often switch back and forth between the string representation and the numeric 0xABCDEFGHI. At least we should not crash though, and just attach the GPU to the CPU, as we do for NICs we can't locate.
The weird ID could be because the server I'm using uses Bitfusion to manage my communication to the GPUs.
We didn't crash. Just error'd out:
RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:911, unhandled system error, NCCL version 2.7.8
ncclSystemError: System call (socket, malloc, munmap, etc) failed.
But my current set up sounds like a dead end judging from what you said. Is there any way you can think of where I can overcome this issue?
We didn't crash. Just error'd out:
Right, that's what I meant, Ideally we should not error out. But given that GPUs are identified by their PCI ID converted to an int64_t
, in your case both GPUs would have the same id, so even if we got past the topology detection, the later code would not work. Supporting this case would be a significant rework of the topology code.
Yea, I understand if this isn't a supported case for y'all since this is probably an edge case. Thanks for the explanation though, I appreciate it.
Just found another github post (you also commented in there) that is referencing a maybe-related issue involving Bitfusion. I am using 2.7.8 just like the OP of the above link. Perhaps I could try using a different version of NCCL? The only tool using NCCL that I have in my env is torch and there might be a way for me to build it using another version of NCCL.
I doubt a more recent version of NCCL would work better in your case, due to the PCI ID of your GPUs. The best would be to find a way to have GPUs shown with a pure numerical PCI ID.
Sigh. Okay. I will go apply for other servers at my lab. Thanks again for the pointers!
I also meet this
SHJS-PF4ZKYLL:92777:92777 [0] NCCL INFO Bootstrap : Using [0]eth0:172.27.213.95<0>
SHJS-PF4ZKYLL:92777:92777 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
SHJS-PF4ZKYLL:92777:92777 [0] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
SHJS-PF4ZKYLL:92777:92777 [0] NCCL INFO NET/Socket : Using [0]eth0:172.27.213.95<0>
SHJS-PF4ZKYLL:92777:92777 [0] NCCL INFO Using network Socket
NCCL version 2.7.8+cuda11.1
SHJS-PF4ZKYLL:92777:92889 [0] graph/xml.cc:332 NCCL WARN Could not find real path of /sys/class/pci_bus/0000:01/../../0000:01:00.0
SHJS-PF4ZKYLL:92777:92889 [0] NCCL INFO graph/xml.cc:469 -> 2
SHJS-PF4ZKYLL:92777:92889 [0] NCCL INFO graph/xml.cc:660 -> 2
SHJS-PF4ZKYLL:92777:92889 [0] NCCL INFO graph/topo.cc:523 -> 2
SHJS-PF4ZKYLL:92777:92889 [0] NCCL INFO init.cc:581 -> 2
SHJS-PF4ZKYLL:92777:92889 [0] NCCL INFO init.cc:840 -> 2
SHJS-PF4ZKYLL:92777:92889 [0] NCCL INFO group.cc:73 -> 2 [Async thread]
You're using a GPU with PCI ID 0000:01
(check nvidia-smi
) and NCCL cannot find its PCI topology in /sys/class/..
That's a problem. You should probably make sure NCCL can access that path.
I use wsl2, watch nvidia-smi
can show info, but use lspci | grep -i nvidia
show nothing. My nccl version is 2.7.8
lix@SHJS-PF4ZKYLL:/mnt/d$ ls /sys/class/pci_bus/
383f:00 38cc:00 497c:00 d13b:00
no 0000:01
NCCL 2.7.8 is very old. Not sure it supports WSL2.
When I try to run data parallel on single machine with 2 GPUs, the following error happened.
There are 2 keys: 1.
2.