NVIDIA / nccl-tests

NCCL Tests
BSD 3-Clause "New" or "Revised" License
809 stars 229 forks source link

./build/all_reduce_perf between nodes failed #180

Open sleepwalker2017 opened 9 months ago

sleepwalker2017 commented 9 months ago
 mpirun --allow-run-as-root  -np 4 -hostfile hostfile ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 4
[1700554343.440877] [10-192-80-132:7356 :0]            sock.c:323  UCX  ERROR   connect(fd=48, dest_addr=172.17.0.1:42887) failed: Connection refused
[1700554343.440876] [10-192-80-132:7355 :0]            sock.c:323  UCX  ERROR   connect(fd=48, dest_addr=172.17.0.1:40235) failed: Connection refused
[10-192-80-132.cls-gebgk6vq.ecp.shopeemobile.com:07356] pml_ucx.c:424  Error: ucp_ep_create(proc=3) failed: Destination is unreachable
[10-192-80-132.cls-gebgk6vq.ecp.shopeemobile.com:07356] pml_ucx.c:477  Error: Failed to resolve UCX endpoint for rank 3
[10-192-80-132.cls-gebgk6vq.ecp.shopeemobile.com:07355] pml_ucx.c:424  Error: ucp_ep_create(proc=2) failed: Destination is unreachable
[10-192-80-132.cls-gebgk6vq.ecp.shopeemobile.com:07355] pml_ucx.c:477  Error: Failed to resolve UCX endpoint for rank 2
[10-192-80-132:07356] *** An error occurred in MPI_Allgather
[10-192-80-132:07356] *** reported by process [1740308481,1]
[10-192-80-132:07356] *** on communicator MPI_COMM_WORLD
[10-192-80-132:07356] *** MPI_ERR_OTHER: known error not in list
[10-192-80-132:07356] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[10-192-80-132:07356] ***    and potentially your MPI job)
[1700554343.440970] [bms-airtrunk-d-g18v3-app-10-192-82-3:9188 :0]            sock.c:323  UCX  ERROR   connect(fd=48, dest_addr=172.17.0.1:63597) failed: Connection refused
[1700554343.440970] [bms-airtrunk-d-g18v3-app-10-192-82-3:9189 :0]            sock.c:323  UCX  ERROR   connect(fd=48, dest_addr=172.17.0.1:53503) failed: Connection refused
[bms-airtrunk-d-g18v3-app-10-192-82-3:09189] pml_ucx.c:424  Error: ucp_ep_create(proc=1) failed: Destination is unreachable
[bms-airtrunk-d-g18v3-app-10-192-82-3:09189] pml_ucx.c:477  Error: Failed to resolve UCX endpoint for rank 1
[bms-airtrunk-d-g18v3-app-10-192-82-3:09188] pml_ucx.c:424  Error: ucp_ep_create(proc=0) failed: Destination is unreachable
[bms-airtrunk-d-g18v3-app-10-192-82-3:09188] pml_ucx.c:477  Error: Failed to resolve UCX endpoint for rank 0
[10-192-80-132.cls-gebgk6vq.ecp.shopeemobile.com:07350] 3 more processes have sent help message help-mpi-errors.txt / mpi_errors_are_fatal
[10-192-80-132.cls-gebgk6vq.ecp.shopeemobile.com:07350] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages

I run mpirun --allow-run-as-root -np 4 -hostfile hostfile echo "hello", it runs ok.

how could this happen? why would it connect to this ip: dest_addr=172.17.0.1:42887? I'm confused about this, any help?

Thank you!

mpirun --version
mpirun (Open MPI) 4.1.5rc2
AddyLaddy commented 9 months ago

hostname is not an MPI program so it doesn't often show up errors with the MPI runtime. I often just download, compile and run a simple MPI program such as:

wget https://raw.githubusercontent.com/pmodels/mpich/main/examples/cpi.c
mpicc -o cpi cpi.c

Those errors look like they're coming from UCX. Maybe try setting UCX_TLS=tcp and maybe set UCX_NET_DEVICES to your Ethernet device