NVIDIA / pyxis

Container plugin for Slurm Workload Manager
Apache License 2.0
263 stars 28 forks source link

UCX fails when trying to run training across 2 nodes #140

Closed RamHPC closed 1 month ago

RamHPC commented 1 month ago

Environment: Slurm - 23.11.5 OpenMPI - 5.0.3 Pmix - 5.0.2 Enroot - 3.4.1-1 UCX - 1.16.0

Trying to run image segmentation benchmark with 2 nodes and running into UCX issues. When I am running a simple cuda application, I don't see any errors with UCX. Followed https://github.com/NVIDIA/pyxis/wiki/Setup#slurmd-configuration to setup Slurm/Pyxis. Added echo "UCX_TLS=tcp,cuda,cuda_copy,cuda_ipc" >> "${ENROOT_ENVIRON}" this to the enroot hooks script.

Enabled, UCX debug messages: [1716594409.541417] [gpu1:1581053:0] ucp_worker.c:1783 UCX INFO ep_cfg[4]: tag(tcp/ib0 tcp/docker0) [1716594409.541424] [gpu1:1581053:0] wireup.c:1192 UCX DEBUG ep 0x1550f5b9b280: am_lane 0 wireup_msg_lane 1 cm_lane keepalive_lane reachable_mds 0x1 [1716594409.541428] [gpu1:1581053:0] wireup.c:1215 UCX DEBUG ep 0x1550f5b9b280: lane[0]: 3:tcp/ib0.0 md[0] -> addr[1].md[0]/tcp/sysdev[255] rma_bw#0 am am_bw#0 [1716594409.541426] [gpu1:1581047:0] tcp_ep.c:259 UCX DEBUG tcp_ep 0x564a97356ff0: created on iface 0x564a96eaa120, fd -1 [1716594409.541430] [gpu1:1581047:0] tcp_cm.c:96 UCX DEBUG tcp_ep 0x564a97356ff0: CLOSED -> CONNECTING for the [172.17.0.1:59765]<->[172.17.0.1:47133]:0 connection [-:-] [1716594409.541406] [gpu2:2200301:a] sock.c:399 UCX DEBUG [192.168.1.121:45797]<->[192.168.1.111:42620] is a connected pair [1716594409.541424] [gpu2:2200301:a] tcp_ep.c:259 UCX DEBUG tcp_ep 0x55efd0971ed0: created on iface 0x55efd0cc0c50, fd 85 [1716594409.541428] [gpu2:2200301:a] tcp_cm.c:106 UCX DEBUG tcp_ep 0x55efd0971ed0: CLOSED -> RECV_MAGIC_NUMBER [1716594409.541436] [gpu2:2200301:a] tcp_cm.c:821 UCX DEBUG tcp_iface 0x55efd0cc0c50: accepted connection from 192.168.1.111:42620 on 192.168.1.121:45797 to tcp_ep 0x55efd0971ed0 (fd 85) [1716594409.541423] [gpu1:1581048:0] sock.c:323 UCX ERROR connect(fd=88, dest_addr=172.17.0.1:49637) failed: Connection refused [1716594409.541432] [gpu1:1581053:0] wireup.c:1215 UCX DEBUG ep 0x1550f5b9b280: lane[1]: 0:tcp/docker0.0 md[0] -> addr[3].md[0]/tcp/sysdev[255] rma_bw#1 wireup [1716594409.541436] [gpu1:1581053:0] tcp_ep.c:259 UCX DEBUG tcp_ep 0x55670b6d9250: created on iface 0x55670ba5af10, fd -1 [1716594409.541442] [gpu1:1581053:0] tcp_cm.c:96 UCX DEBUG tcp_ep 0x55670b6d9250: CLOSED -> CONNECTING for the [192.168.1.111:46925]<->[192.168.1.121:46037]:0 connection [-:-] [1716594409.541463] [gpu2:2200300:a] sock.c:399 UCX DEBUG [192.168.1.121:48105]<->[192.168.1.111:53376] is a connected pair [1716594409.541476] [gpu2:2200300:a] tcp_ep.c:259 UCX DEBUG tcp_ep 0x5649743c50c0: created on iface 0x5649747300b0, fd 94 [1716594409.541479] [gpu2:2200300:a] tcp_cm.c:106 UCX DEBUG tcp_ep 0x5649743c50c0: CLOSED -> RECV_MAGIC_NUMBER [1716594409.541486] [gpu2:2200300:a] tcp_cm.c:821 UCX DEBUG tcp_iface 0x5649747300b0: accepted connection from 192.168.1.111:53376 on 192.168.1.121:48105 to tcp_ep 0x5649743c50c0 (fd 94) [1716594409.541442] [gpu1:1581047:0] tcp_cm.c:96 UCX DEBUG tcp_ep 0x564a97356ff0: CONNECTING -> CONNECTING for the [172.17.0.1:59765]<->[172.17.0.1:47133]:0 connection [-:-] [1716594409.541453] [gpu1:1581053:0] tcp_cm.c:96 UCX DEBUG tcp_ep 0x55670b6d9250: CONNECTING -> CONNECTING for the [192.168.1.111:46925]<->[192.168.1.121:46037]:0 connection [-:-] [1716594409.541480] [gpu1:1581047:0] sock.c:323 UCX ERROR connect(fd=87, dest_addr=172.17.0.1:47133) failed: Connection refused [gpu1:1581048] pml_ucx.c:419 Error: ucp_ep_create(proc=9) failed: Destination is unreachable [gpu1:1581048] pml_ucx.c:472 Error: Failed to resolve UCX endpoint for rank 9

Greatly appreciate any help/guidance on resolving this issue. Thank you!

RamHPC commented 1 month ago

This seems to be an issue with UCX itself nothing to do with the container. Closing the issue here and will open one in UCX.