NVIDIA / nccl

Optimized primitives for collective multi-GPU communication
Other
3.25k stars 825 forks source link

NCCL WARN Could not find real path of... #573

Open ljz756245026 opened 3 years ago

ljz756245026 commented 3 years ago

When I try to run data parallel on single machine with 2 GPUs, the following error happened.

NCCL version 2.7.8+cuda11.0

xxxxx:2573:2612 [1] graph/xml.cc:332 NCCL WARN Could not find real path of /sys/class/pci_bus/0000:1e/../../0000:1e:00.0

xxxxx:2572:2610 [0] graph/xml.cc:332 NCCL WARN Could not find real path of /sys/class/pci_bus/0000:1e/../../0000:1e:00.0
Traceback (most recent call last):
  File "main_ddp.py", line 285, in <module>
    main()
  File "main_ddp.py", line 123, in main
    mp.spawn(main_worker, nprocs=ngpus_per_node, args=(ngpus_per_node, args))
  File "/home/liujz1/anaconda3/envs/myenvs3/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 199, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/home/liujz1/anaconda3/envs/myenvs3/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 157, in start_processes
    while not context.join():
  File "/home/liujz1/anaconda3/envs/myenvs3/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 118, in join
    raise Exception(msg)
Exception:

-- Process 1 terminated with the following error:
Traceback (most recent call last):
  File "/home/liujz1/anaconda3/envs/myenvs3/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
    fn(i, *args)
  File "/home/liujz1/GraphSAGE/main_ddp.py", line 146, in main_worker
    world_size=args.world_size, rank=args.rank)
  File "/home/liujz1/anaconda3/envs/myenvs3/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 455, in init_process_group
    barrier()
  File "/home/liujz1/anaconda3/envs/myenvs3/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 1960, in barrier
    work = _default_pg.barrier()
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1607370156314/work/torch/lib/c10d/ProcessGroupNCCL.cpp:784, unhandled system error, NCCL version 2.7.8

exit status 1

There are 2 keys: 1.

NCCL version 2.7.8+cuda11.0

xxxxx:2573:2612 [1] graph/xml.cc:332 NCCL WARN Could not find real path of /sys/class/pci_bus/0000:1e/../../0000:1e:00.0

xxxxx:2572:2610 [0] graph/xml.cc:332 NCCL WARN Could not find real path of /sys/class/pci_bus/0000:1e/../../0000:1e:00.0

2.

RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1607370156314/work/torch/lib/c10d/ProcessGroupNCCL.cpp:784, unhandled system error, NCCL version 2.7.8
sjeaugey commented 3 years ago

This is coming from the topology detection in NCCL. You seem to be using GPU with PCI ID 0000:1e:00.0; can you check whether /sys/class/pci_bus/0000:1e exists in your NCCL environment, and if so where it points to, and whether from there you can cd to ../../0000:1e:00.0?

Or run this and see where it fails:

realpath /sys/class/pci_bus/0000:1e
cd `realpath /sys/class/pci_bus/0000:1e`
cd ../../0000:1e:00.0
ljz756245026 commented 3 years ago

This is coming from the topology detection in NCCL. You seem to be using GPU with PCI ID 0000:1e:00.0; can you check whether /sys/class/pci_bus/0000:1e exists in your NCCL environment, and if so where it points to, and whether from there you can cd to ../../0000:1e:00.0?

Or run this and see where it fails:

realpath /sys/class/pci_bus/0000:1e
cd `realpath /sys/class/pci_bus/0000:1e`
cd ../../0000:1e:00.0

I checked the file path. There is no such file there.

sjeaugey commented 3 years ago

Could it be /sys is not mounted in your environment somehow?

tymons commented 3 years ago

I have similar issue where I am trying to run pytorch model in one of WSL2-based nvidia-pytorch container: nvcr.io/nvidia/pytorch:21.10-py3

Unfortunatelly I failed to do so:

RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1614378083779/work/torch/lib/c10d/ProcessGroupNCCL.cpp:825, unhandled system error, NCCL version 2.7.8 ncclSystemError: System call (socket, malloc, munmap, etc) failed.

I managed to get some logs, probably error is related to the two NCCL warnings:

0a3ec01864bc:1:1 [0] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v3 symbol. 0a3ec01864bc:1:29 [0] graph/xml.cc:332 NCCL WARN Could not find real path of /sys/class/pci_bus/0000:01/../../0000:01:00.0

My .sys/class/pci_buss directory does not have 0000:01 directory. There are some others but not that one specifically.

obraz

Do you have any idea what should I do next?

My libs for the environment

pytorch==1.8.0
cudatoolkit==11.1.1
python==3.8.5

NCCL version 2.7.8+cuda11.1
sjeaugey commented 3 years ago

Yes, this is the error:

0a3ec01864bc:1:29 [0] graph/xml.cc:332 NCCL WARN Could not find real path of /sys/class/pci_bus/0000:01/../../0000:01:00.0

So you have a GPU which has a PCI ID of 0000:01, yet it is not visible in /sys/class/pci_bus. This is probably a problem with the configuration of your VM or container. Which VM/container system are you using?

tymons commented 3 years ago

Docker Desktop with WSL2, so it is probably related with #442 ?

sjeaugey commented 3 years ago

Ah, indeed, WSL does not report topology in /sys. You need to try with a more recent version of NCCL which will ignore the PCI topology if not found.

Lost-little-dinosaur commented 1 year ago

Ah, indeed, WSL does not report topology in /sys. You need to try with a more recent version of NCCL which will ignore the PCI topology if not found.

I received the same error:

ASUS:3049:3049 [0] NCCL INFO Bootstrap : Using [0]lo:127.0.0.1<0>
ASUS:3049:3049 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation

ASUS:3049:3049 [0] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
ASUS:3049:3049 [0] NCCL INFO NET/Socket : Using [0]lo:127.0.0.1<0>
ASUS:3049:3049 [0] NCCL INFO Using network Socket
NCCL version 2.7.8+cuda10.2

ASUS:3049:3102 [0] graph/xml.cc:332 NCCL WARN Could not find real path of /sys/class/pci_bus/0000:01/../../0000:01:00.0
ASUS:3049:3102 [0] NCCL INFO graph/xml.cc:469 -> 2
ASUS:3049:3102 [0] NCCL INFO graph/xml.cc:660 -> 2
ASUS:3049:3102 [0] NCCL INFO graph/topo.cc:523 -> 2
ASUS:3049:3102 [0] NCCL INFO init.cc:581 -> 2
ASUS:3049:3102 [0] NCCL INFO init.cc:840 -> 2
ASUS:3049:3102 [0] NCCL INFO group.cc:73 -> 2 [Async thread]

But I have already download the latest version of NCCL, which is version 2.8.3, but it still goes wrong My environment is

Ubuntu18.04 on WSL2

pleace help me...

sjeaugey commented 1 year ago

The latest version of NCCL is 2.16. You can always rebuild it inside WSL2 or download it from https://developer.nvidia.com/nccl.

FifthEpoch commented 1 year ago

Same issue here. I am using a remote Ubuntu 20.04 VM with 2 A100s.

bfclient02:14681:14681 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
bfclient02:14681:14681 [0] NCCL INFO NET/IB : No device found.
bfclient02:14681:14681 [0] NCCL INFO NET/Socket : Using [0]ens160:158.132.102.178<0>
bfclient02:14681:14681 [0] NCCL INFO Using network Socket
NCCL version 2.7.8+cuda11.1
bfclient02:14682:14682 [1] NCCL INFO Bootstrap : Using [0]ens160:158.132.102.178<0>
bfclient02:14682:14682 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
bfclient02:14682:14682 [1] NCCL INFO NET/IB : No device found.
bfclient02:14682:14682 [1] NCCL INFO NET/Socket : Using [0]ens160:158.132.102.178<0>
bfclient02:14682:14682 [1] NCCL INFO Using network Socket

bfclient02:14681:14750 [0] graph/xml.cc:332 NCCL WARN Could not find real path of /sys/class/pci_bus/0000:0d/../../0000:0d:00.0
bfclient02:14681:14750 [0] NCCL INFO graph/xml.cc:648 -> 2
bfclient02:14681:14750 [0] NCCL INFO graph/xml.cc:665 -> 2
bfclient02:14681:14750 [0] NCCL INFO graph/topo.cc:523 -> 2
bfclient02:14681:14750 [0] NCCL INFO init.cc:581 -> 2
bfclient02:14681:14750 [0] NCCL INFO init.cc:840 -> 2
bfclient02:14681:14750 [0] NCCL INFO group.cc:73 -> 2 [Async thread]

bfclient02:14682:14752 [1] graph/xml.cc:332 NCCL WARN Could not find real path of /sys/class/pci_bus/0000:0d/../../0000:0d:00.0
bfclient02:14682:14752 [1] NCCL INFO graph/xml.cc:648 -> 2
bfclient02:14682:14752 [1] NCCL INFO graph/xml.cc:665 -> 2
bfclient02:14682:14752 [1] NCCL INFO graph/topo.cc:523 -> 2
bfclient02:14682:14752 [1] NCCL INFO init.cc:581 -> 2
bfclient02:14682:14752 [1] NCCL INFO init.cc:840 -> 2
bfclient02:14682:14752 [1] NCCL INFO group.cc:73 -> 2 [Async thread]

I checked if the two paths exists, but they don't. Here's what I see. realpath /sys/class/pci_bus/0000:0d returns: /sys/devices/pci0000:00/0000:00:16.2/pci_bus/0000:0d

then cd ../../ && ls returns:

0000:00:16.2:pcie001  config                    device           firmware_node  max_link_speed  numa_node    rescan        secondary_bus_number    uevent
0000:00:16.2:pcie004  consistent_dma_mask_bits  dma_mask_bits    irq            max_link_width  pci_bus      reset         subordinate_bus_number  vendor
ari_enabled           current_link_speed        driver           link           modalias        power        reset_method  subsystem
broken_parity_status  current_link_width        driver_override  local_cpulist  msi_bus         power_state  resource      subsystem_device
class                 d3cold_allowed            enable           local_cpus     msi_irqs        remove       revision      subsystem_vendor

I'm guessing 0000:00:16.2:pcie001 and 0000:00:16.2:pcie004 are the A100s, but I am not too sure. Any clues?

@sjeaugey, you mentioned above that "Could it be /sys is not mounted in your environment somehow?". Could you walk me through how I could check for that? Thanks in advance.

sjeaugey commented 1 year ago

Uh. Interesting. First time I see that kind of "PCI ID" with the classic "ABCD:EF:GH.I" but also a ":xxxxxxx" suffix. Supporting that will be complicated, as we often switch back and forth between the string representation and the numeric 0xABCDEFGHI. At least we should not crash though, and just attach the GPU to the CPU, as we do for NICs we can't locate.

FifthEpoch commented 1 year ago

The weird ID could be because the server I'm using uses Bitfusion to manage my communication to the GPUs.

We didn't crash. Just error'd out:

RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:911, unhandled system error, NCCL version 2.7.8
ncclSystemError: System call (socket, malloc, munmap, etc) failed.

But my current set up sounds like a dead end judging from what you said. Is there any way you can think of where I can overcome this issue?

sjeaugey commented 1 year ago

We didn't crash. Just error'd out:

Right, that's what I meant, Ideally we should not error out. But given that GPUs are identified by their PCI ID converted to an int64_t, in your case both GPUs would have the same id, so even if we got past the topology detection, the later code would not work. Supporting this case would be a significant rework of the topology code.

FifthEpoch commented 1 year ago

Yea, I understand if this isn't a supported case for y'all since this is probably an edge case. Thanks for the explanation though, I appreciate it.

Just found another github post (you also commented in there) that is referencing a maybe-related issue involving Bitfusion. I am using 2.7.8 just like the OP of the above link. Perhaps I could try using a different version of NCCL? The only tool using NCCL that I have in my env is torch and there might be a way for me to build it using another version of NCCL.

sjeaugey commented 1 year ago

I doubt a more recent version of NCCL would work better in your case, due to the PCI ID of your GPUs. The best would be to find a way to have GPUs shown with a pure numerical PCI ID.

FifthEpoch commented 1 year ago

Sigh. Okay. I will go apply for other servers at my lab. Thanks again for the pointers!

lix19937 commented 7 months ago

I also meet this

SHJS-PF4ZKYLL:92777:92777 [0] NCCL INFO Bootstrap : Using [0]eth0:172.27.213.95<0>
SHJS-PF4ZKYLL:92777:92777 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation

SHJS-PF4ZKYLL:92777:92777 [0] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
SHJS-PF4ZKYLL:92777:92777 [0] NCCL INFO NET/Socket : Using [0]eth0:172.27.213.95<0>
SHJS-PF4ZKYLL:92777:92777 [0] NCCL INFO Using network Socket
NCCL version 2.7.8+cuda11.1

SHJS-PF4ZKYLL:92777:92889 [0] graph/xml.cc:332 NCCL WARN Could not find real path of /sys/class/pci_bus/0000:01/../../0000:01:00.0
SHJS-PF4ZKYLL:92777:92889 [0] NCCL INFO graph/xml.cc:469 -> 2
SHJS-PF4ZKYLL:92777:92889 [0] NCCL INFO graph/xml.cc:660 -> 2
SHJS-PF4ZKYLL:92777:92889 [0] NCCL INFO graph/topo.cc:523 -> 2
SHJS-PF4ZKYLL:92777:92889 [0] NCCL INFO init.cc:581 -> 2
SHJS-PF4ZKYLL:92777:92889 [0] NCCL INFO init.cc:840 -> 2
SHJS-PF4ZKYLL:92777:92889 [0] NCCL INFO group.cc:73 -> 2 [Async thread]
sjeaugey commented 7 months ago

You're using a GPU with PCI ID 0000:01 (check nvidia-smi) and NCCL cannot find its PCI topology in /sys/class/..

That's a problem. You should probably make sure NCCL can access that path.

lix19937 commented 7 months ago

I use wsl2, watch nvidia-smi can show info, but use lspci | grep -i nvidia show nothing. My nccl version is 2.7.8

lix19937 commented 7 months ago

lix@SHJS-PF4ZKYLL:/mnt/d$ ls /sys/class/pci_bus/

383f:00  38cc:00  497c:00  d13b:00

no 0000:01

sjeaugey commented 7 months ago

NCCL 2.7.8 is very old. Not sure it supports WSL2.