Open a-c-dream opened 3 months ago
Hi,
Would you be able to try again with NCCL 2.21 (latest NCCL release) ?
My nccl version is 2.19.3. I installed nccl by installing pytorch, can you tell me how to upgrade it?
It should be as simple as:
git clone https://github.com/nvidia/nccl
cd nccl
make -j # will take a few minutes
export NCCL_HOME=$PWD/build
cd /path/to/nccl-tests
mpirun -x LD_LIBRARY_PATH=$NCCL_HOME/lib:$LD_LIBRARY_PATH <rest of the mpirun command line to run NCCL tests>
In addition, I get the following error if I execute the command on server 232, and no error if I execute on server 233 (as shown in the first picture). Does that help solve the problem?
Looks like you're missing NCCL on one node. If you recompile it you should make sure it's visible from all nodes (e.g. it's on a shared NFS or copied over to every node at the same location).
As you can see, both servers have nccl
@sjeaugey I added -x LD_LIBRARY_PATH=$NCCL_HOME/lib:$LD_LIBRARY_PATH
and it worked, but on a single machine, not multiple machines. No matter which server it is running on, it is running on a stand-alone version. This is the command I used
mpirun -np 2 -pernode \
--allow-run-as-root \
-host 10.102.0.233,10.102.0.232 \
-mca btl_tcp_if_include eno1 \
-x NCCL_SOCKET_IFNAME=eno1 \
-x NCCL_DEBUG=INFO \
-x LD_LIBRARY_PATH=/home/xxx/anaconda3/envs/new_pytorch/lib/python3.10/site-packages/nvidia/nccl/lib:$LD_LIBRARY_PATH
./build/all_reduce_perf NCCL_P2P_LEVEL=0 -b 8 -e 128M -f 2 -g 4 -c 0
The following error occurred when I used openmpi to run nccl-tests on multiple machines (on two nodes) Here are some log messages