Closed Yang-HangWA closed 5 years ago
That INFO message is normal behaviour and is from the new external network plugin support we have added to the NCCL github code base. It allows a vendor to create their own external network transport plugin for NCCL to make use of. e.g. https://github.com/aws/aws-ofi-nccl
After that message you see another INFO message NCCL INFO Using internal Network [IB|Socket]
which signifies that NCCL has fallen back to using one of it's internal network transports.
@AddyLaddy Thanks for your reply. I run the program on a machine with two gpus, so NCCL just fallen back to using one of it's internal network transports,nothing wrong?Even I use cuda9.0?
Oh I see, if it's a single node job then the network transport is irrelevant. NCCL will use the node's internal systems to communicate between the GPUs (i.e. NVLink or PCI). CUDA 9.x and later should be fine too.
INFO lines are for information only (and basically for us). Only WARN lines should be paid attention to. That said, maybe we could catch that special return code (ENOENT) and print a less concerning message.
@sjeaugey @AddyLaddy In my case, I am trying to test the libnccl-net.so plugin but NCCL still reports not finding it. The location has been added to my LD_LIBRARY_PATH. Are there additional flags I need to set to help NCCL find the plugin?
Did you also make sure all libraries the plugin relies on (e.g. libfabric libraries) are in your LD_LIBRARY_PATH ? The only thing we see is dlopen() fails, but it could be that the library was found but dlopen could not load it.
I have checked and libfabric.so is also in my LD_LIBRARY_PATH. I will see if anything else is missing from the path. Interestingly, nccl-tests build and run fine run fine but using NCCL in TensorFlow/Horovod I get the "No plugin found" message for libnccl-net.so
Most probably it's an issue with mpirun
not propagating the environment.
First thing would be to make sure mpirun
is launched with -x LD_LIBRARY_PATH
.
Yes, I'm already launching mpirun with -x LD_LIBRARY_PATH -- though I'm sure there must be something wrong with paths as I have the file built from https://github.com/aws/aws-ofi-nccl and it passes nccl-tests) Is there any additional debug info I can get from NCCL specifically about why it couldn't load libnccl-net? I'm happy to apply a patch as I'm building from src anyhow
FYI, this nccl-test passes when run on 2 hosts but similar args with mpirun and Horovod do not load libnccl-net
$HOME/anaconda3/bin/mpirun \
-x FI_PROVIDER="efa" \
-x FI_OFI_RXR_RX_COPY_UNEXP=1 -x FI_OFI_RXR_RX_COPY_OOO=1 \
-x FI_EFA_MR_CACHE_ENABLE=1 -x FI_OFI_RXR_INLINE_MR_ENABLE=1 \
-x LD_LIBRARY_PATH=$HOME/aws-ofi-nccl/install/lib/:$HOME/nccl/build/lib:/usr/local/cuda-10.0/lib64:/opt/amazon/efa/lib64:$LD_LIBRARY_PATH \
-x NCCL_DEBUG=INFO -x NCCL_TREE_THRESHOLD=0 --hostfile ~/hosts -n 16 -N 8 \
--mca btl tcp,self --mca btl_tcp_if_exclude lo,docker0 --bind-to none \
$HOME/nccl-tests/build/all_reduce_perf -b 8 -e 1G -f 2 -g 1 -c 1 -n 100
Unfortunately, I could not find a way to get a better message. Maybe try to print the content of dlerror()
at https://github.com/NVIDIA/nccl/blob/master/src/init.cc#L106 ?
Could you be running Horovod from within a container (and nccl-tests on bare metal) ? For the container case, is it possible that not all paths in LD_LIBRARY_PATH be mapped into the container?
I have encountered the same problem, who can help me?
I'm trying to run keras(using tensorflow as backend) with nccl and it build successfully. But when i try to run my application i keep getting error: "Unable to load libnccl-net.so : libnccl-net.so: cannot open shared object file: No such file or directory" I use cuda9.0 and I see the info "NCCL version 2.3.7+cuda8.0",and I have set LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/include:/home/yanghang/nccl/build/lib:/home/yanghang/nccl/build/lib:/usr/local/lib
I've installed the NCCL using exactly these commands: $ git clone https://github.com/NVIDIA/nccl.git $ cd nccl $ sudo make install -j4
I have see the issue https://github.com/NVIDIA/nccl/issues/96,but I can't locate the problem.Can somebody help me?