NVIDIA / nccl

Optimized primitives for collective multi-GPU communication
Other
3.22k stars 810 forks source link

Unable to load libnccl-net.so : libnccl-net.so: cannot open shared object file: No such file or directory #162

Closed Yang-HangWA closed 5 years ago

Yang-HangWA commented 5 years ago

I'm trying to run keras(using tensorflow as backend) with nccl and it build successfully. But when i try to run my application i keep getting error: "Unable to load libnccl-net.so : libnccl-net.so: cannot open shared object file: No such file or directory" I use cuda9.0 and I see the info "NCCL version 2.3.7+cuda8.0",and I have set LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/include:/home/yanghang/nccl/build/lib:/home/yanghang/nccl/build/lib:/usr/local/lib

I've installed the NCCL using exactly these commands: $ git clone https://github.com/NVIDIA/nccl.git $ cd nccl $ sudo make install -j4

I have see the issue https://github.com/NVIDIA/nccl/issues/96,but I can't locate the problem.Can somebody help me? problem

AddyLaddy commented 5 years ago

That INFO message is normal behaviour and is from the new external network plugin support we have added to the NCCL github code base. It allows a vendor to create their own external network transport plugin for NCCL to make use of. e.g. https://github.com/aws/aws-ofi-nccl

After that message you see another INFO message NCCL INFO Using internal Network [IB|Socket] which signifies that NCCL has fallen back to using one of it's internal network transports.

Yang-HangWA commented 5 years ago

@AddyLaddy Thanks for your reply. I run the program on a machine with two gpus, so NCCL just fallen back to using one of it's internal network transports,nothing wrong?Even I use cuda9.0?

AddyLaddy commented 5 years ago

Oh I see, if it's a single node job then the network transport is irrelevant. NCCL will use the node's internal systems to communicate between the GPUs (i.e. NVLink or PCI). CUDA 9.x and later should be fine too.

sjeaugey commented 5 years ago

INFO lines are for information only (and basically for us). Only WARN lines should be paid attention to. That said, maybe we could catch that special return code (ENOENT) and print a less concerning message.

tahouse commented 5 years ago

@sjeaugey @AddyLaddy In my case, I am trying to test the libnccl-net.so plugin but NCCL still reports not finding it. The location has been added to my LD_LIBRARY_PATH. Are there additional flags I need to set to help NCCL find the plugin?

sjeaugey commented 5 years ago

Did you also make sure all libraries the plugin relies on (e.g. libfabric libraries) are in your LD_LIBRARY_PATH ? The only thing we see is dlopen() fails, but it could be that the library was found but dlopen could not load it.

tahouse commented 5 years ago

I have checked and libfabric.so is also in my LD_LIBRARY_PATH. I will see if anything else is missing from the path. Interestingly, nccl-tests build and run fine run fine but using NCCL in TensorFlow/Horovod I get the "No plugin found" message for libnccl-net.so

sjeaugey commented 5 years ago

Most probably it's an issue with mpirun not propagating the environment. First thing would be to make sure mpirun is launched with -x LD_LIBRARY_PATH.

tahouse commented 5 years ago

Yes, I'm already launching mpirun with -x LD_LIBRARY_PATH -- though I'm sure there must be something wrong with paths as I have the file built from https://github.com/aws/aws-ofi-nccl and it passes nccl-tests) Is there any additional debug info I can get from NCCL specifically about why it couldn't load libnccl-net? I'm happy to apply a patch as I'm building from src anyhow

tahouse commented 5 years ago

FYI, this nccl-test passes when run on 2 hosts but similar args with mpirun and Horovod do not load libnccl-net

$HOME/anaconda3/bin/mpirun \
   -x FI_PROVIDER="efa" \
   -x FI_OFI_RXR_RX_COPY_UNEXP=1 -x FI_OFI_RXR_RX_COPY_OOO=1 \
   -x FI_EFA_MR_CACHE_ENABLE=1 -x FI_OFI_RXR_INLINE_MR_ENABLE=1 \
   -x LD_LIBRARY_PATH=$HOME/aws-ofi-nccl/install/lib/:$HOME/nccl/build/lib:/usr/local/cuda-10.0/lib64:/opt/amazon/efa/lib64:$LD_LIBRARY_PATH \
   -x NCCL_DEBUG=INFO -x NCCL_TREE_THRESHOLD=0 --hostfile ~/hosts -n 16 -N 8 \
   --mca btl tcp,self --mca btl_tcp_if_exclude lo,docker0 --bind-to none \
   $HOME/nccl-tests/build/all_reduce_perf -b 8 -e 1G -f 2 -g 1 -c 1 -n 100
sjeaugey commented 5 years ago

Unfortunately, I could not find a way to get a better message. Maybe try to print the content of dlerror() at https://github.com/NVIDIA/nccl/blob/master/src/init.cc#L106 ?

kwen2501 commented 5 years ago

Could you be running Horovod from within a container (and nccl-tests on bare metal) ? For the container case, is it possible that not all paths in LD_LIBRARY_PATH be mapped into the container?

JerryDaHeLian commented 11 months ago

I have encountered the same problem, who can help me?