NVIDIA / CUDALibrarySamples

CUDA Library Samples
Other
1.5k stars 311 forks source link

Hang at the cal_comm_create function while running an example with cusolverMp #183

Closed goodchong closed 2 months ago

goodchong commented 4 months ago

I am trying to run a sample with cuSolverMp, but I found that the program hangs at the cal_comm_create function. Does anyone have any suggestions? Thank you. 微信图片编辑_20240508133916 微信图片_20240508134136 微信图片_20240508134217 微信图片_20240508134221 微信图片_20240508134227 微信截图_20240508134550 微信截图_20240508134601 微信截图_20240508134644

### Tasks
mrogowski commented 4 months ago

Do other MPI applications work properly?

Can you try running with higher debug levels, i.e., CAL_LOG_LEVEL=6 and UCC_LOG_LEVEL=DEBUG, and let us know if that gives more information about what is hanging?

ppandit95 commented 2 months ago

Hi Developers

I m also facing a similar kind of issue wherein the program hangs and upon exporting environment variable CAL_LOG_LEVEL=6 and UCC_LOG_LEVEL=DEBUG , the following output could be seen and the program keeps on running instead of hanging as I could see the processes running upon nvidia-smi : [2024-06-11 22:30:24][cal][76166][Api][cal_comm_create] allgather=0x402f80 nranks=2 rank=1 local_device=1 new_comm=0x7ffe0da4b3d8 [1718125224.335169] [node3:76166:0] ucc_component.c:55 UCC DEBUG failed to load UCC component library: /export/apps/nvidia/hpc_sdk/Linux_x86_64/24.3/comm_libs/12.3/hpcx/hpcx-2.17.1/ucc/lib/ucc/libucc_tl_cuda.so [1718125224.335938] [node3:76166:0] ucc_component.c:55 UCC DEBUG failed to load UCC component library: /export/apps/nvidia/hpc_sdk/Linux_x86_64/24.3/comm_libs/12.3/hpcx/hpcx-2.17.1/ucc/lib/ucc/libucc_tl_nccl.so aFTER Cal comm before cal mpi before cal com2 [2024-06-11 22:30:24][cal][76165][Api][cal_comm_create] allgather=0x402f80 nranks=2 rank=0 local_device=0 new_comm=0x7fff5847f658 [1718125224.337325] [node3:76165:0] ucc_component.c:55 UCC DEBUG failed to load UCC component library: /export/apps/nvidia/hpc_sdk/Linux_x86_64/24.3/comm_libs/12.3/hpcx/hpcx-2.17.1/ucc/lib/ucc/libucc_tl_cuda.so [1718125224.337561] [node3:76165:0] ucc_component.c:55 UCC DEBUG failed to load UCC component library: /export/apps/nvidia/hpc_sdk/Linux_x86_64/24.3/comm_libs/12.3/hpcx/hpcx-2.17.1/ucc/lib/ucc/libucc_tl_nccl.so [1718125224.339917] [node3:76166:0] ucc_component.c:55 UCC DEBUG failed to load UCC component library: /export/apps/nvidia/hpc_sdk/Linux_x86_64/24.3/comm_libs/12.3/hpcx/hpcx-2.17.1/ucc/lib/ucc/libucc_mc_cuda.so [1718125224.340145] [node3:76165:0] ucc_component.c:55 UCC DEBUG failed to load UCC component library: /export/apps/nvidia/hpc_sdk/Linux_x86_64/24.3/comm_libs/12.3/hpcx/hpcx-2.17.1/ucc/lib/ucc/libucc_mc_cuda.so [1718125224.340840] [node3:76166:0] ucc_component.c:55 UCC DEBUG failed to load UCC component library: /export/apps/nvidia/hpc_sdk/Linux_x86_64/24.3/comm_libs/12.3/hpcx/hpcx-2.17.1/ucc/lib/ucc/libucc_ec_cuda.so [1718125224.340969] [node3:76165:0] ucc_component.c:55 UCC DEBUG failed to load UCC component library: /export/apps/nvidia/hpc_sdk/Linux_x86_64/24.3/comm_libs/12.3/hpcx/hpcx-2.17.1/ucc/lib/ucc/libucc_ec_cuda.so [1718125224.343231] [node3:76165:0] ucc_proc_info.c:223 UCC DEBUG libnuma.so: cannot open shared object file: No such file or directory [1718125224.343237] [node3:76165:0] ucc_proc_info.c:306 UCC DEBUG failed to get bound numa id [1718125224.343240] [node3:76165:0] ucc_proc_info.c:311 UCC DEBUG proc pid 76165, host node3, host_hash 12471863499651892538, sockid 0, numaid 255 [1718125224.343246] [node3:76165:0] ucc_constructor.c:186 UCC INFO version: 1.3.1, loaded from: /export/apps/nvidia/hpc_sdk/Linux_x86_64/24.3/comm_libs/12.3/hpcx/hpcx-2.17.1/ucc/lib/libucc.so.1, cfg file: /export/apps/nvidia/hpc_sdk/Linux_x86_64/24.3/math_libs/11.8/share/ucc.conf [1718125224.343232] [node3:76166:0] ucc_proc_info.c:223 UCC DEBUG libnuma.so: cannot open shared object file: No such file or directory [1718125224.343238] [node3:76166:0] ucc_proc_info.c:306 UCC DEBUG failed to get bound numa id [1718125224.343240] [node3:76166:0] ucc_proc_info.c:311 UCC DEBUG proc pid 76166, host node3, host_hash 12471863499651892538, sockid 0, numaid 255 [1718125224.343246] [node3:76166:0] ucc_constructor.c:186 UCC INFO version: 1.3.1, loaded from: /export/apps/nvidia/hpc_sdk/Linux_x86_64/24.3/comm_libs/12.3/hpcx/hpcx-2.17.1/ucc/lib/libucc.so.1, cfg file: /export/apps/nvidia/hpc_sdk/Linux_x86_64/24.3/math_libs/11.8/share/ucc.conf [1718125224.343261] [node3:76166:0] ucc_mc.c:67 UCC DEBUG mc cpu mc initialized [1718125224.343270] [node3:76166:0] ucc_ec.c:60 UCC DEBUG ec cpu ec initialized [1718125224.343261] [node3:76165:0] ucc_mc.c:67 UCC DEBUG mc cpu mc initialized [1718125224.343270] [node3:76165:0] ucc_ec.c:60 UCC DEBUG ec cpu ec initialized [1718125224.343291] [node3:76165:0] cl_basic_lib.c:20 CL_BASIC DEBUG initialized lib object: 0x362d210 [1718125224.343300] [node3:76165:0] ucc_lib.c:152 UCC DEBUG lib_prefix "CAL_UCC_": initialized component "basic" score 10 [1718125224.343291] [node3:76166:0] cl_basic_lib.c:20 CL_BASIC DEBUG initialized lib object: 0x1dbbfc0 [1718125224.343300] [node3:76166:0] ucc_lib.c:152 UCC DEBUG lib_prefix "CAL_UCC_": initialized component "basic" score 10 [1718125224.343318] [node3:76165:0] tl_mlx5_lib.c:19 TL_MLX5 DEBUG initialized lib object: 0x361f950 [1718125224.343329] [node3:76165:0] tl_self_lib.c:20 TL_SELF DEBUG initialized lib object: 0x3626ac0 [1718125224.343319] [node3:76166:0] tl_mlx5_lib.c:19 TL_MLX5 DEBUG initialized lib object: 0x1dab090 [1718125224.343332] [node3:76166:0] tl_self_lib.c:20 TL_SELF DEBUG initialized lib object: 0x1db5800 [1718125224.343370] [node3:76165:0] tl_ucp_lib.c:69 TL_UCP DEBUG initialized lib object: 0x27a96c0 [1718125224.343399] [node3:76165:0] ucc_context.c:242 UCC INFO required TL cuda is not part of the context [1718125224.343401] [node3:76165:0] ucc_context.c:242 UCC INFO required TL nccl is not part of the context [1718125224.343404] [node3:76165:0] ucc_context.c:242 UCC INFO required TL sharp is not part of the context [1718125224.343405] [node3:76165:0] ucc_context.c:242 UCC INFO required TL hcoll is not part of the context [1718125224.343372] [node3:76166:0] tl_ucp_lib.c:69 TL_UCP DEBUG initialized lib object: 0xf386b0 [1718125224.343400] [node3:76166:0] ucc_context.c:242 UCC INFO required TL cuda is not part of the context [1718125224.343404] [node3:76166:0] ucc_context.c:242 UCC INFO required TL nccl is not part of the context [1718125224.343406] [node3:76166:0] ucc_context.c:242 UCC INFO required TL sharp is not part of the context [1718125224.343407] [node3:76166:0] ucc_context.c:242 UCC INFO required TL hcoll is not part of the context [1718125224.352671] [node3:76165:0] tl_ucp_context.c:276 TL_UCP DEBUG initialized tl context: 0x2eee460 [1718125224.352683] [node3:76165:0] cl_basic_context.c:50 CL_BASIC DEBUG initialized cl context: 0x3634c50 [1718125224.352712] [node3:76166:0] tl_ucp_context.c:276 TL_UCP DEBUG initialized tl context: 0x167d550 [1718125224.352724] [node3:76166:0] cl_basic_context.c:50 CL_BASIC DEBUG initialized cl context: 0x1dc3950 [1718125224.352793] [node3:76165:0] tl_ucp_team.c:101 TL_UCP DEBUG posted tl team: 0x366f640 [1718125224.352797] [node3:76165:0] tl_ucp_team.c:200 TL_UCP DEBUG initialized tl team: 0x366f640 [1718125224.352793] [node3:76166:0] tl_ucp_team.c:101 TL_UCP DEBUG posted tl team: 0x1dfe340 [1718125224.352797] [node3:76166:0] tl_ucp_team.c:200 TL_UCP DEBUG initialized tl team: 0x1dfe340 [1718125224.352836] [node3:76165:0] tl_mlx5_ib.c:67 TL_MLX5 DEBUG no IB devices found [1718125224.352841] [node3:76165:0] tl_mlx5_context.c:128 TL_MLX5 DEBUG failed to allocate ibv_context [1718125224.352847] [node3:76165:0] tl_mlx5_context.c:286 TL_MLX5 DEBUG failed initialize tl context: 0x283dca0 [1718125224.352850] [node3:76165:0] ucc_context.c:812 UCC DEBUG ctx create epilog for mlx5 failed: Not found [1718125224.352854] [node3:76165:0] tl_mlx5_context.c:68 TL_MLX5 DEBUG finalizing tl context: 0x283dca0 Also,the cuFFTMp code is running perfecting but I m facing this issue while running cuSOLVERMp code.Any help in this regard will prove beneficial.

Many Thanks Pushkar

mrogowski commented 2 months ago

Please try running with UCC_TLS=^mlx5,sharp and let us know if you see any difference.

marsaev commented 2 months ago

@ppandit95

[1718125224.340840] [node3:76166:0]   ucc_component.c:55   UCC  DEBUG failed to load UCC component library: /export/apps/nvidia/hpc_sdk/Linux_x86_64/24.3/comm_libs/12.3/hpcx/hpcx-2.17.1/ucc/lib/ucc/libucc_ec_cuda.so

this is critical error and this sounds like HPC-X is not present in the environment. Did you use

source hpcx-*init*.sh
hpcx_load

to load HPC-X to the environment?

ppandit95 commented 2 months ago

No, I havnt loaded it @marsaev , is it different library from nvhpc ???

ppandit95 commented 2 months ago

@mrogowski I tried with environment variable as you suggested but still I am facing the same issue wherein the debug output is same as before and the code just keeps on running without producing any output

ppandit95 commented 2 months ago

Hi @mrogowski and @marsaev

As you suggested I tried after loading the hpcx environment from the path as suggested in makefile but still the cusolvermp code is hanging with the error as [2024-06-13 18:21:10][cal][18219][Api][cal_comm_create] allgather=0x402f80 nranks=2 rank=1 local_device=1 new_comm=0x7fff4dc2da78 [2024-06-13 18:21:10][cal][18218][Api][cal_comm_create] allgather=0x402f80 nranks=2 rank=0 local_device=0 new_comm=0x7ffe91076488 [1718283070.989846] [node3:18218:0] ucc_proc_info.c:223 UCC DEBUG libnuma.so: cannot open shared object file: No such file or directory [1718283070.989855] [node3:18218:0] ucc_proc_info.c:306 UCC DEBUG failed to get bound numa id [1718283070.989860] [node3:18218:0] ucc_proc_info.c:311 UCC DEBUG proc pid 18218, host node3, host_hash 474198595611230941, sockid 0, numaid 255 [1718283070.989866] [node3:18218:0] ucc_constructor.c:186 UCC INFO version: 1.3.1, loaded from: /export/apps/nvidia/hpc_sdk/Linux_x86_64/24.3/comm_libs/12.3/hpcx/hpcx-2.17.1/ucc/lib/libucc.so.1, cfg file: /export/apps/nvidia/hpc_sdk/Linux_x86_64/24.3/math_libs/share/ucc.conf [1718283070.989881] [node3:18218:0] ucc_mc.c:67 UCC DEBUG mc cpu mc initialized [1718283070.989892] [node3:18218:0] mc_cuda.c:65 cuda mc DEBUG driver version 12040 [1718283070.989898] [node3:18218:0] ucc_mc.c:67 UCC DEBUG mc cuda mc initialized [1718283070.989908] [node3:18218:0] ucc_ec.c:60 UCC DEBUG ec cpu ec initialized [1718283070.990035] [node3:18219:0] ucc_proc_info.c:223 UCC DEBUG libnuma.so: cannot open shared object file: No such file or directory [1718283070.990043] [node3:18219:0] ucc_proc_info.c:306 UCC DEBUG failed to get bound numa id [1718283070.990047] [node3:18219:0] ucc_proc_info.c:311 UCC DEBUG proc pid 18219, host node3, host_hash 474198595611230941, sockid 0, numaid 255 [1718283070.990053] [node3:18219:0] ucc_constructor.c:186 UCC INFO version: 1.3.1, loaded from: /export/apps/nvidia/hpc_sdk/Linux_x86_64/24.3/comm_libs/12.3/hpcx/hpcx-2.17.1/ucc/lib/libucc.so.1, cfg file: /export/apps/nvidia/hpc_sdk/Linux_x86_64/24.3/math_libs/share/ucc.conf [1718283070.990067] [node3:18219:0] ucc_mc.c:67 UCC DEBUG mc cpu mc initialized [1718283070.990076] [node3:18219:0] mc_cuda.c:65 cuda mc DEBUG driver version 12040 [1718283070.990080] [node3:18219:0] ucc_mc.c:67 UCC DEBUG mc cuda mc initialized [1718283070.990088] [node3:18219:0] ucc_ec.c:60 UCC DEBUG ec cpu ec initialized [1718283070.991183] [node3:18218:0] ucc_ec.c:60 UCC DEBUG ec cuda ec initialized [1718283070.991206] [node3:18218:0] cl_basic_lib.c:20 CL_BASIC DEBUG initialized lib object: 0x1e566e0 [1718283070.991216] [node3:18218:0] ucc_lib.c:152 UCC DEBUG lib_prefix "CALUCC": initialized component "basic" score 10 [1718283070.991231] [node3:18218:0] tl_cuda_lib.c:35 TL_CUDA DEBUG initialized lib object: 0x1b3ed80 [1718283070.991247] [node3:18218:0] tl_mlx5_lib.c:19 TL_MLX5 DEBUG initialized lib object: 0x1e43bc0 [1718283070.991255] [node3:18218:0] tl_nccl_lib.c:16 TL_NCCL DEBUG initialized lib object: 0x1e57330 [1718283070.991264] [node3:18218:0] tl_self_lib.c:20 TL_SELF DEBUG initialized lib object: 0x1e4c6b0 [1718283070.991307] [node3:18218:0] tl_ucp_lib.c:69 TL_UCP DEBUG initialized lib object: 0x1125540 [1718283070.991340] [node3:18218:0] ucc_context.c:242 UCC INFO required TL sharp is not part of the context [1718283070.991343] [node3:18218:0] ucc_context.c:242 UCC INFO required TL hcoll is not part of the context [1718283070.991360] [node3:18219:0] ucc_ec.c:60 UCC DEBUG ec cuda ec initialized [1718283070.991381] [node3:18219:0] cl_basic_lib.c:20 CL_BASIC DEBUG initialized lib object: 0x33088c0 [1718283070.991389] [node3:18219:0] ucc_lib.c:152 UCC DEBUG lib_prefix "CALUCC": initialized component "basic" score 10 [1718283070.991404] [node3:18219:0] tl_cuda_lib.c:35 TL_CUDA DEBUG initialized lib object: 0x2ff1110 [1718283070.991420] [node3:18219:0] tl_mlx5_lib.c:19 TL_MLX5 DEBUG initialized lib object: 0x32f5da0 [1718283070.991431] [node3:18219:0] tl_nccl_lib.c:16 TL_NCCL DEBUG initialized lib object: 0x3309510 [1718283070.991439] [node3:18219:0] tl_self_lib.c:20 TL_SELF DEBUG initialized lib object: 0x32fe890 [1718283070.991490] [node3:18219:0] tl_ucp_lib.c:69 TL_UCP DEBUG initialized lib object: 0x25dd560 [1718283070.991525] [node3:18219:0] ucc_context.c:242 UCC INFO required TL sharp is not part of the context [1718283070.991529] [node3:18219:0] ucc_context.c:242 UCC INFO required TL hcoll is not part of the context [1718283071.252420] [node3:18218:0] tl_cuda_context.c:71 TL_CUDA DEBUG initialized tl context: 0x1e47200 [1718283071.252642] [node3:18218:0] tl_nccl_context.c:182 TL_NCCL DEBUG using memops completion sync [1718283071.252938] [node3:18218:0] tl_nccl_context.c:205 TL_NCCL DEBUG initialized tl context: 0x110c050 [1718283071.253371] [node3:18219:0] tl_cuda_context.c:71 TL_CUDA DEBUG initialized tl context: 0x32f93e0 [1718283071.253515] [node3:18219:0] tl_nccl_context.c:182 TL_NCCL DEBUG using memops completion sync [1718283071.253756] [node3:18219:0] tl_nccl_context.c:205 TL_NCCL DEBUG initialized tl context: 0x25c4070 [1718283071.262706] [node3:18218:0] tl_ucp_context.c:276 TL_UCP DEBUG initialized tl context: 0xf15760 [1718283071.262720] [node3:18218:0] cl_basic_context.c:50 CL_BASIC DEBUG initialized cl context: 0x1e77be0 [1718283071.262758] [node3:18219:0] tl_ucp_context.c:276 TL_UCP DEBUG initialized tl context: 0x23cd760 [1718283071.262771] [node3:18219:0] cl_basic_context.c:50 CL_BASIC DEBUG initialized cl context: 0x3329d20 [1718283071.262822] [node3:18219:0] tl_ucp_team.c:101 TL_UCP DEBUG posted tl team: 0x3362f10 [1718283071.262828] [node3:18219:0] tl_ucp_team.c:200 TL_UCP DEBUG initialized tl team: 0x3362f10 [1718283071.262822] [node3:18218:0] tl_ucp_team.c:101 TL_UCP DEBUG posted tl team: 0x1eb0c80 [1718283071.262828] [node3:18218:0] tl_ucp_team.c:200 TL_UCP DEBUG initialized tl team: 0x1eb0c80 [1718283071.262857] [node3:18218:0] tl_mlx5_ib.c:67 TL_MLX5 DEBUG no IB devices found [1718283071.262864] [node3:18218:0] tl_mlx5_context.c:128 TL_MLX5 DEBUG failed to allocate ibv_context [1718283071.262868] [node3:18218:0] tl_mlx5_context.c:286 TL_MLX5 DEBUG failed initialize tl context: 0x1156c40 [1718283071.262872] [node3:18218:0] ucc_context.c:812 UCC DEBUG ctx create epilog for mlx5 failed: Not found [1718283071.262876] [node3:18218:0] tl_mlx5_context.c:68 TL_MLX5 DEBUG finalizing tl context: 0x1156c40

So any headers in this regard will be useful.

Many Thanks Pushkar

marsaev commented 2 months ago

@ppandit95 Sorry for the delayed reply, glad you figured how to load HPC-X.

Regarding your latest error - environment output now looks good. Can you share how you compile and run example? Also, can you share output of nvidia-smi and nvidia-smi topo -m to see what GPUs are used and how they are connected on the system?

ppandit95 commented 2 months ago

@marsaev Sorry for late response but after loading the hpcx environment I m getting the messages as -

[2024-06-15 14:41:31][cal][34611][Api][cal_comm_create] allgather=0x402f80 nranks=2 rank=0 local_device=0 new_comm=0x7ffd9b8231e8 [2024-06-15 14:41:31][cal][34612][Api][cal_comm_create] allgather=0x402f80 nranks=2 rank=1 local_device=1 new_comm=0x7ffe4b582d28 [1718442691.942125] [node3:34611:0] ucc_proc_info.c:223 UCC DEBUG libnuma.so: cannot open shared object file: No such file or directory [1718442691.942134] [node3:34611:0] ucc_proc_info.c:306 UCC DEBUG failed to get bound numa id [1718442691.942137] [node3:34611:0] ucc_proc_info.c:311 UCC DEBUG proc pid 34611, host node3, host_hash 474198595611230941, sockid 0, numaid 255 [1718442691.942143] [node3:34611:0] ucc_constructor.c:186 UCC INFO version: 1.3.1, loaded from: /export/apps/nvidia/hpc_sdk/Linux_x86_64/24.3/comm_libs/12.3/hpcx/hpcx-2.17.1/ucc/lib/libucc.so.1, cfg file: /export/apps/nvidia/hpc_sdk/Linux_x86_64/24.3/math_libs/share/ucc.conf [1718442691.942159] [node3:34611:0] ucc_mc.c:67 UCC DEBUG mc cpu mc initialized [1718442691.942171] [node3:34611:0] mc_cuda.c:65 cuda mc DEBUG driver version 12040 [1718442691.942177] [node3:34611:0] ucc_mc.c:67 UCC DEBUG mc cuda mc initialized [1718442691.942186] [node3:34611:0] ucc_ec.c:60 UCC DEBUG ec cpu ec initialized [1718442691.942125] [node3:34612:0] ucc_proc_info.c:223 UCC DEBUG libnuma.so: cannot open shared object file: No such file or directory [1718442691.942134] [node3:34612:0] ucc_proc_info.c:306 UCC DEBUG failed to get bound numa id [1718442691.942138] [node3:34612:0] ucc_proc_info.c:311 UCC DEBUG proc pid 34612, host node3, host_hash 474198595611230941, sockid 0, numaid 255 [1718442691.942143] [node3:34612:0] ucc_constructor.c:186 UCC INFO version: 1.3.1, loaded from: /export/apps/nvidia/hpc_sdk/Linux_x86_64/24.3/comm_libs/12.3/hpcx/hpcx-2.17.1/ucc/lib/libucc.so.1, cfg file: /export/apps/nvidia/hpc_sdk/Linux_x86_64/24.3/math_libs/share/ucc.conf [1718442691.942159] [node3:34612:0] ucc_mc.c:67 UCC DEBUG mc cpu mc initialized [1718442691.942171] [node3:34612:0] mc_cuda.c:65 cuda mc DEBUG driver version 12040 [1718442691.942178] [node3:34612:0] ucc_mc.c:67 UCC DEBUG mc cuda mc initialized [1718442691.942186] [node3:34612:0] ucc_ec.c:60 UCC DEBUG ec cpu ec initialized [1718442691.943462] [node3:34611:0] ucc_ec.c:60 UCC DEBUG ec cuda ec initialized [1718442691.943486] [node3:34611:0] cl_basic_lib.c:20 CL_BASIC DEBUG initialized lib object: 0x2a26300 [1718442691.943497] [node3:34611:0] ucc_lib.c:152 UCC DEBUG lib_prefix "CALUCC": initialized component "basic" score 10 [1718442691.943470] [node3:34612:0] ucc_ec.c:60 UCC DEBUG ec cuda ec initialized [1718442691.943493] [node3:34612:0] cl_basic_lib.c:20 CL_BASIC DEBUG initialized lib object: 0x22ed270 [1718442691.943502] [node3:34612:0] ucc_lib.c:152 UCC DEBUG lib_prefix "CALUCC": initialized component "basic" score 10 [1718442691.943517] [node3:34612:0] tl_cuda_lib.c:35 TL_CUDA DEBUG initialized lib object: 0x1fd5c70 [1718442691.943513] [node3:34611:0] tl_cuda_lib.c:35 TL_CUDA DEBUG initialized lib object: 0x270ef40 [1718442691.943530] [node3:34611:0] tl_mlx5_lib.c:19 TL_MLX5 DEBUG initialized lib object: 0x2a087f0 [1718442691.943540] [node3:34611:0] tl_nccl_lib.c:16 TL_NCCL DEBUG initialized lib object: 0x2a27320 [1718442691.943549] [node3:34611:0] tl_self_lib.c:20 TL_SELF DEBUG initialized lib object: 0x2a26540 [1718442691.943531] [node3:34612:0] tl_mlx5_lib.c:19 TL_MLX5 DEBUG initialized lib object: 0x22db420 [1718442691.943540] [node3:34612:0] tl_nccl_lib.c:16 TL_NCCL DEBUG initialized lib object: 0x22ee290 [1718442691.943549] [node3:34612:0] tl_self_lib.c:20 TL_SELF DEBUG initialized lib object: 0x22ed4b0 [1718442691.943593] [node3:34612:0] tl_ucp_lib.c:69 TL_UCP DEBUG initialized lib object: 0x15effe0 [1718442691.943629] [node3:34612:0] ucc_context.c:242 UCC INFO required TL sharp is not part of the context [1718442691.943632] [node3:34612:0] ucc_context.c:242 UCC INFO required TL hcoll is not part of the context [1718442691.943594] [node3:34611:0] tl_ucp_lib.c:69 TL_UCP DEBUG initialized lib object: 0x1d22ee0 [1718442691.943628] [node3:34611:0] ucc_context.c:242 UCC INFO required TL sharp is not part of the context [1718442691.943632] [node3:34611:0] ucc_context.c:242 UCC INFO required TL hcoll is not part of the context [1718442692.217448] [node3:34611:0] tl_cuda_context.c:71 TL_CUDA DEBUG initialized tl context: 0x2a1c290 [1718442692.217501] [node3:34612:0] tl_cuda_context.c:71 TL_CUDA DEBUG initialized tl context: 0x22e3200 [1718442692.217557] [node3:34611:0] tl_nccl_context.c:182 TL_NCCL DEBUG using memops completion sync [1718442692.217602] [node3:34612:0] tl_nccl_context.c:182 TL_NCCL DEBUG using memops completion sync [1718442692.217878] [node3:34611:0] tl_nccl_context.c:205 TL_NCCL DEBUG initialized tl context: 0x2a2b070 [1718442692.217881] [node3:34612:0] tl_nccl_context.c:205 TL_NCCL DEBUG initialized tl context: 0x22f1f30 [1718442692.227832] [node3:34612:0] tl_ucp_context.c:276 TL_UCP DEBUG initialized tl context: 0x164dd10 [1718442692.227845] [node3:34612:0] cl_basic_context.c:50 CL_BASIC DEBUG initialized cl context: 0x230f150 [1718442692.227853] [node3:34611:0] tl_ucp_context.c:276 TL_UCP DEBUG initialized tl context: 0x1e3f100 [1718442692.227864] [node3:34611:0] cl_basic_context.c:50 CL_BASIC DEBUG initialized cl context: 0x2a483b0 [1718442692.227928] [node3:34612:0] tl_ucp_team.c:101 TL_UCP DEBUG posted tl team: 0x2347810 [1718442692.227933] [node3:34612:0] tl_ucp_team.c:200 TL_UCP DEBUG initialized tl team: 0x2347810 [1718442692.227928] [node3:34611:0] tl_ucp_team.c:101 TL_UCP DEBUG posted tl team: 0x2a80a00 [1718442692.227934] [node3:34611:0] tl_ucp_team.c:200 TL_UCP DEBUG initialized tl team: 0x2a80a00 [1718442692.227964] [node3:34611:0] tl_mlx5_ib.c:67 TL_MLX5 DEBUG no IB devices found [1718442692.227970] [node3:34611:0] tl_mlx5_context.c:128 TL_MLX5 DEBUG failed to allocate ibv_context [1718442692.227975] [node3:34611:0] tl_mlx5_context.c:286 TL_MLX5 DEBUG failed initialize tl context: 0x1d24bd0 [1718442692.227978] [node3:34611:0] ucc_context.c:812 UCC DEBUG ctx create epilog for mlx5 failed: Not found [1718442692.227982] [node3:34611:0] tl_mlx5_context.c:68 TL_MLX5 DEBUG finalizing tl context: 0x1d24bd0

The procedure that I followed to run cuSOLVERMp examples is as follows -

  1. Loaded nvhpc-24.3 as module load hpc_sdk/nvhpc-24.3
  2. Included environment variables (that helped me to run cuFFTMp examples) as

NVHPC_CUDA_VERSION=12.3 export NVHPC_COMM_LIBS_HOME=${NVHPC_ROOT}/comm_libs export MPI_HOME=${NVHPC_ROOT}/comm_libs/${NVHPC_CUDA_VERSION}/hpcx/latest/ompi export CUFFT_LIB=${NVHPC_ROOT}/math_libs/lib64 export CUFFT_INC=${NVHPC_ROOT}/math_libs/include/cufftmp export NVSHMEM_LIB=${NVHPC_ROOT}/comm_libs/${NVHPC_CUDA_VERSION}/nvshmem/lib export NVHPC_CUDA_HOME=${NVHPC_ROOT}/cuda export NVSHMEM_INC=${NVHPC_ROOT}/comm_libs/${NVHPC_CUDA_VERSION}/nvshmem/include export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:${MPI_HOME}/lib export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:${CUDA_HOME}/lib64 export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:${NVHPC_ROOT}/comm_libs/${NVHPC_CUDA_VERSION}/nccl/lib export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:${CUFFT_LIB} export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:${NVHPC_CUDA_HOME} export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:${NVSHMEM_LIB}

to Link to correct version of MPI

export PATH=${NVHPC_ROOT}/comm_libs/${NVHPC_CUDA_VERSION}/hpcx/latest/ompi/bin:$PATH

  1. Loaded HPC-X environment as source /export/apps/nvidia/hpc_sdk/Linux_x86_64/24.3/comm_libs/12.3/hpcx/latest/hpcx-init-ompi.sh hpcx_load

  2. Exported library path for GDRCOPY as - export LD_LIBRARY_PATH=/export/apps/libs/gdrcopy/lib/:${LD_LIBRARY_PATH}

  3. Exported debug flags as - export CAL_LOG_LEVEL=6 export UCC_LOG_LEVEL=DEBUG export UCC_TLS=^mlx5,sharp

  4. Compiled the cuSOLVERMp code with 'make'

  5. Ran the example as - 'mpirun -n 2 -mca coll_hcoll_enable 0 ./mp_getrf_getrs ' Any help in this regard will be beneficial.

Many Thanks Pushkar

ppandit95 commented 2 months ago

As you suggested @marsaev , the output oof 'nvidia-smi' is as -

+-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 550.54.14 Driver Version: 550.54.14 CUDA Version: 12.4 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA A100-SXM4-80GB Off | 00000000:01:00.0 Off | 0 | | N/A 28C P0 61W / 500W | 0MiB / 81920MiB | 0% Default | | | | Disabled | +-----------------------------------------+------------------------+----------------------+ | 1 NVIDIA A100-SXM4-80GB Off | 00000000:41:00.0 Off | 0 | | N/A 27C P0 60W / 500W | 0MiB / 81920MiB | 0% Default | | | | Disabled | +-----------------------------------------+------------------------+----------------------+ | 2 NVIDIA A100-SXM4-80GB Off | 00000000:81:00.0 Off | 0 | | N/A 28C P0 62W / 500W | 0MiB / 81920MiB | 0% Default | | | | Disabled | +-----------------------------------------+------------------------+----------------------+ | 3 NVIDIA A100-SXM4-80GB Off | 00000000:C1:00.0 Off | 0 | | N/A 27C P0 57W / 500W | 0MiB / 81920MiB | 0% Default | | | | Disabled | +-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | No running processes found | +-----------------------------------------------------------------------------------------+ Moreover, the output of 'nvidia-smi topo -m' is as - GPU0 GPU1 GPU2 GPU3 CPU Affinity NUMA Affinity GPU NUMA ID GPU0 X NV4 NV4 NV4 24-31,88-95 3 N/A GPU1 NV4 X NV4 NV4 8-15,72-79 1 N/A GPU2 NV4 NV4 X NV4 56-63,120-127 7 N/A GPU3 NV4 NV4 NV4 X 40-47,104-111 5 N/A

Legend:

X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks

goodchong commented 2 months ago

@ppandit95

i finally run the program with: mpirun -n 2 -mca btl ^openib -mca pml ^ucx -mca coll ^hcoll ./program

well... it start to move

ppandit95 commented 2 months ago

@goodchong thanks alot for suggesting out there, even my program started to work properly but still wondering what's going wrong

ppandit95 commented 2 months ago

Unfortunately, when I tried to run with n=3 then I encountered the following error message as - pushkar@node3:~/CUDALibrarySamples/cuSOLVERMp$ export UCC_TLS=^mlx5,sharp pushkar@node3:~/CUDALibrarySamples/cuSOLVERMp$ export CAL_LOG_LEVEL=6 pushkar@node3:~/CUDALibrarySamples/cuSOLVERMp$ export UCC_LOG_LEVEL=DEBUG pushkar@node3:~/CUDALibrarySamples/cuSOLVERMp$ mpirun -n 4 -mca btl ^openib -mca pml ^ucx -mca coll ^hcoll ./mp_potrf_potrs Parameters: m=1 n=10 nrhs=1 mbA=2 nbA=2 mbB=2 nbB=2 mbQ=2 nbQ=2 mbZ=0 nbZ=0ia=3 ja=3 ib=3 jb=1 iq=1 jq=1 iz=0 jz=0 p=2 q=1 grid_layout= verbose=0 Parameters: m=1 n=10 nrhs=1 mbA=2 nbA=2 mbB=2 nbB=2 mbQ=2 nbQ=2 mbZ=0 nbZ=0ia=3 ja=3 ib=3 jb=1 iq=1 jq=1 iz=0 jz=0 p=2 q=1 grid_layout= verbose=0 Parameters: m=1 n=10 nrhs=1 mbA=2 nbA=2 mbB=2 nbB=2 mbQ=2 nbQ=2 mbZ=0 nbZ=0ia=3 ja=3 ib=3 jb=1 iq=1 jq=1 iz=0 jz=0 p=2 q=1 grid_layout= verbose=0 Parameters: m=1 n=10 nrhs=1 mbA=2 nbA=2 mbB=2 nbB=2 mbQ=2 nbQ=2 mbZ=0 nbZ=0ia=3 ja=3 ib=3 jb=1 iq=1 jq=1 iz=0 jz=0 p=2 q=1 grid_layout= verbose=0 [2024-06-23 12:51:10][cal][116664][Api][cal_comm_create] allgather=0x402fc0 nranks=4 rank=1 local_device=1 new_comm=0x7ffe3ab8d2b8 [2024-06-23 12:51:10][cal][116663][Api][cal_comm_create] allgather=0x402fc0 nranks=4 rank=0 local_device=0 new_comm=0x7ffc8cee3858 [2024-06-23 12:51:10][cal][116666][Api][cal_comm_create] allgather=0x402fc0 nranks=4 rank=3 local_device=3 new_comm=0x7ffdf99954d8 [2024-06-23 12:51:10][cal][116665][Api][cal_comm_create] allgather=0x402fc0 nranks=4 rank=2 local_device=2 new_comm=0x7ffd36b9ef08 [1719127270.283940] [node3:116664:0] ucc_proc_info.c:223 UCC DEBUG libnuma.so: cannot open shared object file: No such file or directory [1719127270.283950] [node3:116664:0] ucc_proc_info.c:306 UCC DEBUG failed to get bound numa id [1719127270.283956] [node3:116664:0] ucc_proc_info.c:311 UCC DEBUG proc pid 116664, host node3, host_hash 474198595611230941, sockid 0, numaid 255 [1719127270.283962] [node3:116664:0] ucc_constructor.c:186 UCC INFO version: 1.3.1, loaded from: /export/apps/nvidia/hpc_sdk/Linux_x86_64/24.3/comm_libs/12.3/hpcx/hpcx-2.17.1/ucc/lib/libucc.so.1, cfg file: /export/apps/nvidia/hpc_sdk/Linux_x86_64/24.3/math_libs/share/ucc.conf [1719127270.283977] [node3:116664:0] ucc_mc.c:67 UCC DEBUG mc cpu mc initialized [1719127270.283989] [node3:116664:0] mc_cuda.c:65 cuda mc DEBUG driver version 12040 [1719127270.283996] [node3:116664:0] ucc_mc.c:67 UCC DEBUG mc cuda mc initialized [1719127270.283931] [node3:116665:0] ucc_proc_info.c:223 UCC DEBUG libnuma.so: cannot open shared object file: No such file or directory [1719127270.283942] [node3:116665:0] ucc_proc_info.c:306 UCC DEBUG failed to get bound numa id [1719127270.283945] [node3:116665:0] ucc_proc_info.c:311 UCC DEBUG proc pid 116665, host node3, host_hash 474198595611230941, sockid 0, numaid 255 [1719127270.283952] [node3:116665:0] ucc_constructor.c:186 UCC INFO version: 1.3.1, loaded from: /export/apps/nvidia/hpc_sdk/Linux_x86_64/24.3/comm_libs/12.3/hpcx/hpcx-2.17.1/ucc/lib/libucc.so.1, cfg file: /export/apps/nvidia/hpc_sdk/Linux_x86_64/24.3/math_libs/share/ucc.conf [1719127270.283968] [node3:116665:0] ucc_mc.c:67 UCC DEBUG mc cpu mc initialized [1719127270.283981] [node3:116665:0] mc_cuda.c:65 cuda mc DEBUG driver version 12040 [1719127270.283988] [node3:116665:0] ucc_mc.c:67 UCC DEBUG mc cuda mc initialized [1719127270.283998] [node3:116665:0] ucc_ec.c:60 UCC DEBUG ec cpu ec initialized [1719127270.283960] [node3:116663:0] ucc_proc_info.c:223 UCC DEBUG libnuma.so: cannot open shared object file: No such file or directory [1719127270.283971] [node3:116663:0] ucc_proc_info.c:306 UCC DEBUG failed to get bound numa id [1719127270.283974] [node3:116663:0] ucc_proc_info.c:311 UCC DEBUG proc pid 116663, host node3, host_hash 474198595611230941, sockid 0, numaid 255 [1719127270.283982] [node3:116663:0] ucc_constructor.c:186 UCC INFO version: 1.3.1, loaded from: /export/apps/nvidia/hpc_sdk/Linux_x86_64/24.3/comm_libs/12.3/hpcx/hpcx-2.17.1/ucc/lib/libucc.so.1, cfg file: /export/apps/nvidia/hpc_sdk/Linux_x86_64/24.3/math_libs/share/ucc.conf [1719127270.283999] [node3:116663:0] ucc_mc.c:67 UCC DEBUG mc cpu mc initialized [1719127270.284011] [node3:116663:0] mc_cuda.c:65 cuda mc DEBUG driver version 12040 [1719127270.284018] [node3:116663:0] ucc_mc.c:67 UCC DEBUG mc cuda mc initialized [1719127270.284028] [node3:116663:0] ucc_ec.c:60 UCC DEBUG ec cpu ec initialized [1719127270.283952] [node3:116666:0] ucc_proc_info.c:223 UCC DEBUG libnuma.so: cannot open shared object file: No such file or directory [1719127270.283964] [node3:116666:0] ucc_proc_info.c:306 UCC DEBUG failed to get bound numa id [1719127270.283967] [node3:116666:0] ucc_proc_info.c:311 UCC DEBUG proc pid 116666, host node3, host_hash 474198595611230941, sockid 0, numaid 255 [1719127270.283973] [node3:116666:0] ucc_constructor.c:186 UCC INFO version: 1.3.1, loaded from: /export/apps/nvidia/hpc_sdk/Linux_x86_64/24.3/comm_libs/12.3/hpcx/hpcx-2.17.1/ucc/lib/libucc.so.1, cfg file: /export/apps/nvidia/hpc_sdk/Linux_x86_64/24.3/math_libs/share/ucc.conf [1719127270.283987] [node3:116666:0] ucc_mc.c:67 UCC DEBUG mc cpu mc initialized [1719127270.284000] [node3:116666:0] mc_cuda.c:65 cuda mc DEBUG driver version 12040 [1719127270.284007] [node3:116666:0] ucc_mc.c:67 UCC DEBUG mc cuda mc initialized [1719127270.284017] [node3:116666:0] ucc_ec.c:60 UCC DEBUG ec cpu ec initialized [1719127270.284006] [node3:116664:0] ucc_ec.c:60 UCC DEBUG ec cpu ec initialized [1719127270.285332] [node3:116664:0] ucc_ec.c:60 UCC DEBUG ec cuda ec initialized [1719127270.285360] [node3:116664:0] cl_basic_lib.c:20 CL_BASIC DEBUG initialized lib object: 0x38debd0 [1719127270.285350] [node3:116665:0] ucc_ec.c:60 UCC DEBUG ec cuda ec initialized [1719127270.285377] [node3:116665:0] cl_basic_lib.c:20 CL_BASIC DEBUG initialized lib object: 0x2b17a10 [1719127270.285372] [node3:116664:0] ucc_lib.c:152 UCC DEBUG lib_prefix "CALUCC": initialized component "basic" score 10 [1719127270.285389] [node3:116664:0] tl_cuda_lib.c:35 TL_CUDA DEBUG initialized lib object: 0x35c7f50 [1719127270.285406] [node3:116664:0] tl_mlx5_lib.c:19 TL_MLX5 DEBUG initialized lib object: 0x38c18d0 [1719127270.285363] [node3:116663:0] ucc_ec.c:60 UCC DEBUG ec cuda ec initialized [1719127270.285393] [node3:116663:0] cl_basic_lib.c:20 CL_BASIC DEBUG initialized lib object: 0x2148310 [1719127270.285405] [node3:116663:0] ucc_lib.c:152 UCC DEBUG lib_prefix "CALUCC": initialized component "basic" score 10 [1719127270.285390] [node3:116665:0] ucc_lib.c:152 UCC DEBUG lib_prefix "CALUCC": initialized component "basic" score 10 [1719127270.285408] [node3:116665:0] tl_cuda_lib.c:35 TL_CUDA DEBUG initialized lib object: 0x2801050 [1719127270.285426] [node3:116665:0] tl_mlx5_lib.c:19 TL_MLX5 DEBUG initialized lib object: 0x2b064d0 [1719127270.285436] [node3:116665:0] tl_nccl_lib.c:16 TL_NCCL DEBUG initialized lib object: 0x2b17740 [1719127270.285446] [node3:116665:0] tl_self_lib.c:20 TL_SELF DEBUG initialized lib object: 0x2b0d740 [1719127270.285416] [node3:116664:0] tl_nccl_lib.c:16 TL_NCCL DEBUG initialized lib object: 0x38de900 [1719127270.285426] [node3:116664:0] tl_self_lib.c:20 TL_SELF DEBUG initialized lib object: 0x38d4900 [1719127270.285424] [node3:116663:0] tl_cuda_lib.c:35 TL_CUDA DEBUG initialized lib object: 0x1e31350 [1719127270.285443] [node3:116663:0] tl_mlx5_lib.c:19 TL_MLX5 DEBUG initialized lib object: 0x2134580 [1719127270.285453] [node3:116663:0] tl_nccl_lib.c:16 TL_NCCL DEBUG initialized lib object: 0x2148040 [1719127270.285463] [node3:116663:0] tl_self_lib.c:20 TL_SELF DEBUG initialized lib object: 0x213e040 [1719127270.285473] [node3:116664:0] tl_ucp_lib.c:69 TL_UCP DEBUG initialized lib object: 0x2b0d180 [1719127270.285493] [node3:116665:0] tl_ucp_lib.c:69 TL_UCP DEBUG initialized lib object: 0x1d46240 [1719127270.285512] [node3:116663:0] tl_ucp_lib.c:69 TL_UCP DEBUG initialized lib object: 0x1376240 [1719127270.285514] [node3:116664:0] ucc_context.c:242 UCC INFO required TL sharp is not part of the context [1719127270.285518] [node3:116664:0] ucc_context.c:242 UCC INFO required TL hcoll is not part of the context [1719127270.285536] [node3:116665:0] ucc_context.c:242 UCC INFO required TL sharp is not part of the context [1719127270.285540] [node3:116665:0] ucc_context.c:242 UCC INFO required TL hcoll is not part of the context [1719127270.285556] [node3:116663:0] ucc_context.c:242 UCC INFO required TL sharp is not part of the context [1719127270.285559] [node3:116663:0] ucc_context.c:242 UCC INFO required TL hcoll is not part of the context [1719127270.285380] [node3:116666:0] ucc_ec.c:60 UCC DEBUG ec cuda ec initialized [1719127270.285407] [node3:116666:0] cl_basic_lib.c:20 CL_BASIC DEBUG initialized lib object: 0x2931c00 [1719127270.285419] [node3:116666:0] ucc_lib.c:152 UCC DEBUG lib_prefix "CALUCC": initialized component "basic" score 10 [1719127270.285436] [node3:116666:0] tl_cuda_lib.c:35 TL_CUDA DEBUG initialized lib object: 0x261adc0 [1719127270.285453] [node3:116666:0] tl_mlx5_lib.c:19 TL_MLX5 DEBUG initialized lib object: 0x2913ae0 [1719127270.285462] [node3:116666:0] tl_nccl_lib.c:16 TL_NCCL DEBUG initialized lib object: 0x2931930 [1719127270.285472] [node3:116666:0] tl_self_lib.c:20 TL_SELF DEBUG initialized lib object: 0x2927930 [1719127270.285524] [node3:116666:0] tl_ucp_lib.c:69 TL_UCP DEBUG initialized lib object: 0x1b60200 [1719127270.285567] [node3:116666:0] ucc_context.c:242 UCC INFO required TL sharp is not part of the context [1719127270.285569] [node3:116666:0] ucc_context.c:242 UCC INFO required TL hcoll is not part of the context [1719127270.624804] [node3:116665:0] tl_cuda_context.c:71 TL_CUDA DEBUG initialized tl context: 0x2b08440 [1719127270.627997] [node3:116665:0] tl_mlx5_context.c:47 TL_MLX5 DEBUG failed to create rcache [1719127270.628020] [node3:116665:0] ucc_context.c:407 UCC DEBUG failed to create tl context for mlx5 [1719127270.628034] [node3:116665:0] tl_nccl_context.c:182 TL_NCCL DEBUG using memops completion sync [1719127270.633558] [node3:116665:0] tl_nccl_context.c:205 TL_NCCL DEBUG initialized tl context: 0x2b29330 [1719127270.650259] [node3:116663:0] tl_cuda_context.c:71 TL_CUDA DEBUG initialized tl context: 0x2138c60 [1719127270.650356] [node3:116664:0] tl_cuda_context.c:71 TL_CUDA DEBUG initialized tl context: 0x38cf520 [1719127270.650363] [node3:116666:0] tl_cuda_context.c:71 TL_CUDA DEBUG initialized tl context: 0x29224c0 [1719127270.652892] [node3:116666:0] tl_mlx5_context.c:47 TL_MLX5 DEBUG failed to create rcache [1719127270.652913] [node3:116666:0] ucc_context.c:407 UCC DEBUG failed to create tl context for mlx5 [1719127270.652927] [node3:116666:0] tl_nccl_context.c:182 TL_NCCL DEBUG using memops completion sync [1719127270.653279] [node3:116663:0] tl_mlx5_context.c:47 TL_MLX5 DEBUG failed to create rcache [1719127270.653305] [node3:116663:0] ucc_context.c:407 UCC DEBUG failed to create tl context for mlx5 [1719127270.653326] [node3:116663:0] tl_nccl_context.c:182 TL_NCCL DEBUG using memops completion sync [1719127270.653325] [node3:116664:0] tl_mlx5_context.c:47 TL_MLX5 DEBUG failed to create rcache [1719127270.653346] [node3:116664:0] ucc_context.c:407 UCC DEBUG failed to create tl context for mlx5 [1719127270.653361] [node3:116664:0] tl_nccl_context.c:182 TL_NCCL DEBUG using memops completion sync [1719127270.653402] [node3:116666:0] tl_nccl_context.c:205 TL_NCCL DEBUG initialized tl context: 0x2943520 [1719127270.653898] [node3:116663:0] tl_nccl_context.c:205 TL_NCCL DEBUG initialized tl context: 0x2159c30 [1719127270.653925] [node3:116664:0] tl_nccl_context.c:205 TL_NCCL DEBUG initialized tl context: 0x38b23d0 [1719127270.717007] [node3:116665:0] tl_ucp_context.c:276 TL_UCP DEBUG initialized tl context: 0x2156840 [1719127270.717034] [node3:116665:0] cl_basic_context.c:39 CL_BASIC DEBUG TL mlx5 context is not available, skipping [1719127270.717040] [node3:116665:0] cl_basic_context.c:50 CL_BASIC DEBUG initialized cl context: 0x2b451d0 [1719127270.724986] [node3:116666:0] tl_ucp_context.c:276 TL_UCP DEBUG initialized tl context: 0x1f705b0 [1719127270.725012] [node3:116666:0] cl_basic_context.c:39 CL_BASIC DEBUG TL mlx5 context is not available, skipping [1719127270.725019] [node3:116666:0] cl_basic_context.c:50 CL_BASIC DEBUG initialized cl context: 0x295f3e0 [1719127270.735773] [node3:116664:0] tl_ucp_context.c:276 TL_UCP DEBUG initialized tl context: 0x2f1d600 [1719127270.735812] [node3:116664:0] cl_basic_context.c:39 CL_BASIC DEBUG TL mlx5 context is not available, skipping [1719127270.735819] [node3:116664:0] cl_basic_context.c:50 CL_BASIC DEBUG initialized cl context: 0x390c320 [1719127270.737625] [node3:116663:0] tl_ucp_context.c:276 TL_UCP DEBUG initialized tl context: 0x17869f0 [1719127270.737683] [node3:116663:0] cl_basic_context.c:39 CL_BASIC DEBUG TL mlx5 context is not available, skipping [1719127270.737695] [node3:116663:0] cl_basic_context.c:50 CL_BASIC DEBUG initialized cl context: 0x2175a00 [1719127270.737984] [node3:116664:0] tl_ucp_team.c:101 TL_UCP DEBUG posted tl team: 0x3946700 [1719127270.737992] [node3:116664:0] tl_ucp_team.c:200 TL_UCP DEBUG initialized tl team: 0x3946700 [1719127270.737997] [node3:116664:0] ucc_context.c:833 UCC DEBUG created ucc context 0x38df140 for lib CALUCC [1719127270.737984] [node3:116665:0] tl_ucp_team.c:101 TL_UCP DEBUG posted tl team: 0x1e61610 [1719127270.737994] [node3:116665:0] tl_ucp_team.c:200 TL_UCP DEBUG initialized tl team: 0x1e61610 [1719127270.737999] [node3:116665:0] ucc_context.c:833 UCC DEBUG created ucc context 0x2b17f80 for lib CALUCC [1719127270.737986] [node3:116666:0] tl_ucp_team.c:101 TL_UCP DEBUG posted tl team: 0x299a370 [1719127270.737996] [node3:116666:0] tl_ucp_team.c:200 TL_UCP DEBUG initialized tl team: 0x299a370 [1719127270.738001] [node3:116666:0] ucc_context.c:833 UCC DEBUG created ucc context 0x2932170 for lib CALUCC [1719127270.738011] [node3:116663:0] tl_ucp_team.c:101 TL_UCP DEBUG posted tl team: 0x1492700 [1719127270.738022] [node3:116663:0] tl_ucp_team.c:200 TL_UCP DEBUG initialized tl team: 0x1492700 [1719127270.738027] [node3:116663:0] ucc_context.c:833 UCC DEBUG created ucc context 0x2148880 for lib CALUCC [1719127270.739985] [node3:116664:0] tl_mlx5_context.c:47 TL_MLX5 DEBUG failed to create rcache [1719127270.739995] [node3:116664:0] ucc_context.c:407 UCC DEBUG failed to create tl context for mlx5 [1719127270.740066] [node3:116666:0] tl_mlx5_context.c:47 TL_MLX5 DEBUG failed to create rcache [1719127270.740076] [node3:116666:0] ucc_context.c:407 UCC DEBUG failed to create tl context for mlx5 [1719127270.740138] [node3:116665:0] tl_mlx5_context.c:47 TL_MLX5 DEBUG failed to create rcache [1719127270.740148] [node3:116665:0] ucc_context.c:407 UCC DEBUG failed to create tl context for mlx5 [1719127270.740927] [node3:116663:0] tl_mlx5_context.c:47 TL_MLX5 DEBUG failed to create rcache [1719127270.740937] [node3:116663:0] ucc_context.c:407 UCC DEBUG failed to create tl context for mlx5 [1719127270.754869] [node3:116666:0] tl_ucp_context.c:276 TL_UCP DEBUG initialized tl context: 0x1b9e840 [1719127270.754877] [node3:116666:0] cl_basic_context.c:39 CL_BASIC DEBUG TL cuda context is not available, skipping [1719127270.754880] [node3:116666:0] cl_basic_context.c:39 CL_BASIC DEBUG TL mlx5 context is not available, skipping [1719127270.754881] [node3:116666:0] cl_basic_context.c:39 CL_BASIC DEBUG TL nccl context is not available, skipping [1719127270.754883] [node3:116666:0] cl_basic_context.c:50 CL_BASIC DEBUG initialized cl context: 0x29a5890 [1719127270.758228] [node3:116664:0] tl_ucp_context.c:276 TL_UCP DEBUG initialized tl context: 0x2b259c0 [1719127270.758236] [node3:116664:0] cl_basic_context.c:39 CL_BASIC DEBUG TL cuda context is not available, skipping [1719127270.758238] [node3:116664:0] cl_basic_context.c:39 CL_BASIC DEBUG TL mlx5 context is not available, skipping [1719127270.758240] [node3:116664:0] cl_basic_context.c:39 CL_BASIC DEBUG TL nccl context is not available, skipping [1719127270.758241] [node3:116664:0] cl_basic_context.c:50 CL_BASIC DEBUG initialized cl context: 0x3951c20 [1719127270.760412] [node3:116663:0] tl_ucp_context.c:276 TL_UCP DEBUG initialized tl context: 0x13b16c0 [1719127270.760420] [node3:116663:0] cl_basic_context.c:39 CL_BASIC DEBUG TL cuda context is not available, skipping [1719127270.760423] [node3:116663:0] cl_basic_context.c:39 CL_BASIC DEBUG TL mlx5 context is not available, skipping [1719127270.760424] [node3:116663:0] cl_basic_context.c:39 CL_BASIC DEBUG TL nccl context is not available, skipping [1719127270.760426] [node3:116663:0] cl_basic_context.c:50 CL_BASIC DEBUG initialized cl context: 0x21bb730 [1719127270.761479] [node3:116665:0] tl_ucp_context.c:276 TL_UCP DEBUG initialized tl context: 0x1d84840 [1719127270.761488] [node3:116665:0] cl_basic_context.c:39 CL_BASIC DEBUG TL cuda context is not available, skipping [1719127270.761490] [node3:116665:0] cl_basic_context.c:39 CL_BASIC DEBUG TL mlx5 context is not available, skipping [1719127270.761492] [node3:116665:0] cl_basic_context.c:39 CL_BASIC DEBUG TL nccl context is not available, skipping [1719127270.761493] [node3:116665:0] cl_basic_context.c:50 CL_BASIC DEBUG initialized cl context: 0x2b8b030 [1719127270.761529] [node3:116665:0] tl_ucp_team.c:101 TL_UCP DEBUG posted tl team: 0x2bbf3d0 [1719127270.761532] [node3:116665:0] tl_ucp_team.c:200 TL_UCP DEBUG initialized tl team: 0x2bbf3d0 [1719127270.761534] [node3:116665:0] ucc_context.c:833 UCC DEBUG created ucc context 0x1e61b20 for lib CALUCC [1719127270.761526] [node3:116663:0] tl_ucp_team.c:101 TL_UCP DEBUG posted tl team: 0x21efac0 [1719127270.761530] [node3:116663:0] tl_ucp_team.c:200 TL_UCP DEBUG initialized tl team: 0x21efac0 [1719127270.761531] [node3:116663:0] ucc_context.c:833 UCC DEBUG created ucc context 0x1492c10 for lib CALUCC [1719127270.761547] [node3:116663:0] ucc_team.c:370 UCC DEBUG team 0x2148e50 rank 0, ctx_rank 0, map_type 3 [1719127270.761527] [node3:116666:0] tl_ucp_team.c:101 TL_UCP DEBUG posted tl team: 0x29d9bc0 [1719127270.761533] [node3:116666:0] tl_ucp_team.c:200 TL_UCP DEBUG initialized tl team: 0x29d9bc0 [1719127270.761535] [node3:116666:0] ucc_context.c:833 UCC DEBUG created ucc context 0x299a880 for lib CALUCC [1719127270.761549] [node3:116666:0] ucc_team.c:370 UCC DEBUG team 0x2932740 rank 3, ctx_rank 3, map_type 3 [1719127270.761529] [node3:116664:0] tl_ucp_team.c:101 TL_UCP DEBUG posted tl team: 0x3986130 [1719127270.761535] [node3:116664:0] tl_ucp_team.c:200 TL_UCP DEBUG initialized tl team: 0x3986130 [1719127270.761537] [node3:116664:0] ucc_context.c:833 UCC DEBUG created ucc context 0x3946bf0 for lib CALUCC [1719127270.761550] [node3:116664:0] ucc_team.c:370 UCC DEBUG team 0x38df670 rank 1, ctx_rank 1, map_type 3 [1719127270.761551] [node3:116665:0] ucc_team.c:370 UCC DEBUG team 0x2b18550 rank 2, ctx_rank 2, map_type 3 [1719127270.762806] [node3:116665:0] tl_cuda_team.c:109 TL_CUDA DEBUG posted tl team: 0x2bc0390 [1719127270.762813] [node3:116665:0] cl_basic_team.c:52 CL_BASIC DEBUG posted cl team: 0x1d5d6f0 [1719127270.762823] [node3:116666:0] tl_cuda_team.c:109 TL_CUDA DEBUG posted tl team: 0x29dab80 [1719127270.762829] [node3:116666:0] cl_basic_team.c:52 CL_BASIC DEBUG posted cl team: 0x1b776f0 [1719127270.762837] [node3:116664:0] tl_cuda_team.c:109 TL_CUDA DEBUG posted tl team: 0x39870d0 [1719127270.762843] [node3:116664:0] cl_basic_team.c:52 CL_BASIC DEBUG posted cl team: 0x29857b0 [1719127270.762905] [node3:116663:0] tl_cuda_team.c:109 TL_CUDA DEBUG posted tl team: 0x21f0a80 [1719127270.762912] [node3:116663:0] cl_basic_team.c:52 CL_BASIC DEBUG posted cl team: 0x138e560 [1719127270.763954] [node3:116666:0] tl_cuda_cache.c:277 UCC DEBUG ipc-cache: tl_cuda cache new region:0x29058a0 [0x7f75e4000000..0x7f75e5000000] size:16777216 [1719127270.764232] [node3:116665:0] tl_cuda_cache.c:277 UCC DEBUG ipc-cache: tl_cuda cache new region:0x2aeb900 [0x7f75e4000000..0x7f75e5000000] size:16777216 [1719127270.765624] [node3:116666:0] tl_cuda_cache.c:277 UCC DEBUG ipc-cache: tl_cuda cache new region:0x2904fd0 [0x7fc75e000000..0x7fc75f000000] size:16777216 [1719127270.766050] [node3:116665:0] tl_cuda_cache.c:277 UCC DEBUG ipc-cache: tl_cuda cache new region:0x2aeb0a0 [0x7fc75e000000..0x7fc75f000000] size:16777216 [1719127270.767922] [node3:116666:0] tl_cuda_cache.c:277 UCC DEBUG ipc-cache: tl_cuda cache new region:0x2a084b0 [0x7f1472000000..0x7f1473000000] size:16777216 [1719127270.767934] [node3:116666:0] tl_cuda_team_topo.c:451 TL_CUDA DEBUG dev 0000:c1:00.0 (3) to dev 0000:01:00.0 (0): 4 direct links [1719127270.767936] [node3:116666:0] tl_cuda_team_topo.c:451 TL_CUDA DEBUG dev 0000:c1:00.0 (3) to dev 0000:41:00.0 (1): 4 direct links [1719127270.767939] [node3:116666:0] tl_cuda_team_topo.c:451 TL_CUDA DEBUG dev 0000:c1:00.0 (3) to dev 0000:81:00.0 (2): 4 direct links [1719127270.767941] [node3:116666:0] tl_cuda_team_topo.c:445 TL_CUDA DEBUG dev 0000:c1:00.0 (3) to dev 0000:c1:00.0 (3): same device [1719127270.767943] [node3:116666:0] tl_cuda_team_topo.c:483 TL_CUDA DEBUG ring 0: 3 send to 0 [1719127270.767945] [node3:116666:0] tl_cuda_team_topo.c:483 TL_CUDA DEBUG ring 1: 3 send to 2 [1719127270.768606] [node3:116664:0] tl_cuda_cache.c:277 UCC DEBUG ipc-cache: tl_cuda cache new region:0x38b29b0 [0x7f75e4000000..0x7f75e5000000] size:16777216 [1719127270.769741] [node3:116663:0] tl_cuda_cache.c:277 UCC DEBUG ipc-cache: tl_cuda cache new region:0x211bc70 [0x7fc75e000000..0x7fc75f000000] size:16777216 [1719127270.770155] [node3:116664:0] tl_cuda_cache.c:277 UCC DEBUG ipc-cache: tl_cuda cache new region:0x38b2160 [0x7f1472000000..0x7f1473000000] size:16777216 [1719127270.771536] [node3:116664:0] tl_cuda_cache.c:277 UCC DEBUG ipc-cache: tl_cuda cache new region:0x39b3d10 [0x7f7082000000..0x7f7083000000] size:16777216 [1719127270.771548] [node3:116664:0] tl_cuda_team_topo.c:451 TL_CUDA DEBUG dev 0000:41:00.0 (1) to dev 0000:01:00.0 (0): 4 direct links [1719127270.771551] [node3:116664:0] tl_cuda_team_topo.c:445 TL_CUDA DEBUG dev 0000:41:00.0 (1) to dev 0000:41:00.0 (1): same device [1719127270.771553] [node3:116664:0] tl_cuda_team_topo.c:451 TL_CUDA DEBUG dev 0000:41:00.0 (1) to dev 0000:81:00.0 (2): 4 direct links [1719127270.771555] [node3:116664:0] tl_cuda_team_topo.c:451 TL_CUDA DEBUG dev 0000:41:00.0 (1) to dev 0000:c1:00.0 (3): 4 direct links [1719127270.771558] [node3:116664:0] tl_cuda_team_topo.c:483 TL_CUDA DEBUG ring 0: 1 send to 2 [1719127270.771561] [node3:116664:0] tl_cuda_team_topo.c:483 TL_CUDA DEBUG ring 1: 1 send to 0 [1719127270.771955] [node3:116663:0] tl_cuda_cache.c:277 UCC DEBUG ipc-cache: tl_cuda cache new region:0x211b420 [0x7f1472000000..0x7f1473000000] size:16777216 [1719127270.772923] [node3:116663:0] tl_cuda_cache.c:277 UCC DEBUG ipc-cache: tl_cuda cache new region:0x221dfd0 [0x7f7082000000..0x7f7083000000] size:16777216 [1719127270.772935] [node3:116663:0] tl_cuda_team_topo.c:445 TL_CUDA DEBUG dev 0000:01:00.0 (0) to dev 0000:01:00.0 (0): same device [1719127270.772938] [node3:116663:0] tl_cuda_team_topo.c:451 TL_CUDA DEBUG dev 0000:01:00.0 (0) to dev 0000:41:00.0 (1): 4 direct links [1719127270.772940] [node3:116663:0] tl_cuda_team_topo.c:451 TL_CUDA DEBUG dev 0000:01:00.0 (0) to dev 0000:81:00.0 (2): 4 direct links [1719127270.772942] [node3:116663:0] tl_cuda_team_topo.c:451 TL_CUDA DEBUG dev 0000:01:00.0 (0) to dev 0000:c1:00.0 (3): 4 direct links [1719127270.772945] [node3:116663:0] tl_cuda_team_topo.c:483 TL_CUDA DEBUG ring 0: 0 send to 1 [1719127270.772948] [node3:116663:0] tl_cuda_team_topo.c:483 TL_CUDA DEBUG ring 1: 0 send to 3 [1719127270.773531] [node3:116665:0] tl_cuda_cache.c:277 UCC DEBUG ipc-cache: tl_cuda cache new region:0x2bed8e0 [0x7f7082000000..0x7f7083000000] size:16777216 [1719127270.773544] [node3:116665:0] tl_cuda_team_topo.c:451 TL_CUDA DEBUG dev 0000:81:00.0 (2) to dev 0000:01:00.0 (0): 4 direct links [1719127270.773546] [node3:116665:0] tl_cuda_team_topo.c:451 TL_CUDA DEBUG dev 0000:81:00.0 (2) to dev 0000:41:00.0 (1): 4 direct links [1719127270.773548] [node3:116665:0] tl_cuda_team_topo.c:445 TL_CUDA DEBUG dev 0000:81:00.0 (2) to dev 0000:81:00.0 (2): same device [1719127270.773550] [node3:116665:0] tl_cuda_team_topo.c:451 TL_CUDA DEBUG dev 0000:81:00.0 (2) to dev 0000:c1:00.0 (3): 4 direct links [1719127270.773553] [node3:116665:0] tl_cuda_team_topo.c:483 TL_CUDA DEBUG ring 0: 2 send to 3 [1719127270.773556] [node3:116665:0] tl_cuda_team_topo.c:483 TL_CUDA DEBUG ring 1: 2 send to 1 [1719127270.774489] [node3:116664:0] tl_cuda_team.c:314 TL_CUDA DEBUG initialized tl team: 0x39870d0 [1719127270.774528] [node3:116663:0] tl_cuda_team.c:314 TL_CUDA DEBUG initialized tl team: 0x21f0a80 [1719127270.774569] [node3:116665:0] tl_cuda_team.c:314 TL_CUDA DEBUG initialized tl team: 0x2bc0390 [1719127270.774586] [node3:116666:0] tl_cuda_team.c:314 TL_CUDA DEBUG initialized tl team: 0x29dab80 [1719127273.519363] [node3:116664:0] tl_nccl_team.c:177 TL_NCCL DEBUG initialized tl team: 0x2b245c0 [1719127273.519408] [node3:116664:0] ucc_tl.c:294 TL_SELF DEBUG team size 4 is too big, max supported 1 [1719127273.519372] [node3:116663:0] tl_nccl_team.c:177 TL_NCCL DEBUG initialized tl team: 0x138e9f0 [1719127273.519422] [node3:116663:0] ucc_tl.c:294 TL_SELF DEBUG team size 4 is too big, max supported 1 [1719127273.519478] [node3:116663:0] tl_shm_team.c:158 TL_SHM DEBUG using perf params: generic [1719127273.519383] [node3:116665:0] tl_nccl_team.c:177 TL_NCCL DEBUG initialized tl team: 0x1d5d1f0 [1719127273.519434] [node3:116665:0] ucc_tl.c:294 TL_SELF DEBUG team size 4 is too big, max supported 1 [1719127273.519408] [node3:116666:0] tl_nccl_team.c:177 TL_NCCL DEBUG initialized tl team: 0x1b771f0 [1719127273.519475] [node3:116666:0] ucc_tl.c:294 TL_SELF DEBUG team size 4 is too big, max supported 1 [1719127273.519598] [node3:116665:0] tl_ucp_team.c:83 UCC DEBUG section not found [1719127273.519604] [node3:116665:0] tl_ucp_team.c:101 TL_UCP DEBUG posted tl team: 0x2a8f070 [1719127273.519606] [node3:116665:0] tl_ucp_team.c:200 TL_UCP DEBUG initialized tl team: 0x2a8f070 [1719127273.519609] [node3:116665:0] cl_basic_team.c:122 CL_BASIC DEBUG initialized tl cuda team [1719127273.519610] [node3:116665:0] cl_basic_team.c:122 CL_BASIC DEBUG initialized tl nccl team [1719127273.519612] [node3:116665:0] cl_basic_team.c:126 CL_BASIC DEBUG failed to create tl self team: (-1) [1719127273.519613] [node3:116665:0] cl_basic_team.c:122 CL_BASIC DEBUG initialized tl shm team [1719127273.519615] [node3:116665:0] cl_basic_team.c:122 CL_BASIC DEBUG initialized tl ucp team [1719127273.519597] [node3:116664:0] tl_ucp_team.c:83 UCC DEBUG section not found [1719127273.519603] [node3:116664:0] tl_ucp_team.c:101 TL_UCP DEBUG posted tl team: 0x3856100 [1719127273.519605] [node3:116664:0] tl_ucp_team.c:200 TL_UCP DEBUG initialized tl team: 0x3856100 [1719127273.519607] [node3:116664:0] cl_basic_team.c:122 CL_BASIC DEBUG initialized tl cuda team [1719127273.519609] [node3:116664:0] cl_basic_team.c:122 CL_BASIC DEBUG initialized tl nccl team [1719127273.519611] [node3:116664:0] cl_basic_team.c:126 CL_BASIC DEBUG failed to create tl self team: (-1) [1719127273.519613] [node3:116664:0] cl_basic_team.c:122 CL_BASIC DEBUG initialized tl shm team [1719127273.519614] [node3:116664:0] cl_basic_team.c:122 CL_BASIC DEBUG initialized tl ucp team [1719127273.519598] [node3:116663:0] tl_ucp_team.c:83 UCC DEBUG section not found [1719127273.519604] [node3:116663:0] tl_ucp_team.c:101 TL_UCP DEBUG posted tl team: 0x20bf3e0 [1719127273.519606] [node3:116663:0] tl_ucp_team.c:200 TL_UCP DEBUG initialized tl team: 0x20bf3e0 [1719127273.519608] [node3:116663:0] cl_basic_team.c:122 CL_BASIC DEBUG initialized tl cuda team [1719127273.519610] [node3:116663:0] cl_basic_team.c:122 CL_BASIC DEBUG initialized tl nccl team [1719127273.519612] [node3:116663:0] cl_basic_team.c:126 CL_BASIC DEBUG failed to create tl self team: (-1) [1719127273.519613] [node3:116663:0] cl_basic_team.c:122 CL_BASIC DEBUG initialized tl shm team [1719127273.519615] [node3:116663:0] cl_basic_team.c:122 CL_BASIC DEBUG initialized tl ucp team [1719127273.519599] [node3:116666:0] tl_ucp_team.c:83 UCC DEBUG section not found [1719127273.519606] [node3:116666:0] tl_ucp_team.c:101 TL_UCP DEBUG posted tl team: 0x28a8f60 [1719127273.519609] [node3:116666:0] tl_ucp_team.c:200 TL_UCP DEBUG initialized tl team: 0x28a8f60 [1719127273.519613] [node3:116666:0] cl_basic_team.c:122 CL_BASIC DEBUG initialized tl cuda team [1719127273.519615] [node3:116666:0] cl_basic_team.c:122 CL_BASIC DEBUG initialized tl nccl team [1719127273.519618] [node3:116666:0] cl_basic_team.c:126 CL_BASIC DEBUG failed to create tl self team: (-1) [1719127273.519621] [node3:116666:0] cl_basic_team.c:122 CL_BASIC DEBUG initialized tl shm team [1719127273.519626] [node3:116666:0] cl_basic_team.c:122 CL_BASIC DEBUG initialized tl ucp team [1719127273.519682] [node3:116664:0] tl_ucp_team.c:230 TL_UCP DEBUG enable support for memory type host [1719127273.519686] [node3:116664:0] tl_ucp_team.c:230 TL_UCP DEBUG enable support for memory type cuda [1719127273.519687] [node3:116664:0] tl_ucp_team.c:230 TL_UCP DEBUG enable support for memory type cuda-managed [1719127273.519683] [node3:116663:0] tl_ucp_team.c:230 TL_UCP DEBUG enable support for memory type host [1719127273.519688] [node3:116663:0] tl_ucp_team.c:230 TL_UCP DEBUG enable support for memory type cuda [1719127273.519689] [node3:116663:0] tl_ucp_team.c:230 TL_UCP DEBUG enable support for memory type cuda-managed [1719127273.519691] [node3:116665:0] tl_ucp_team.c:230 TL_UCP DEBUG enable support for memory type host [1719127273.519695] [node3:116665:0] tl_ucp_team.c:230 TL_UCP DEBUG enable support for memory type cuda [1719127273.519697] [node3:116665:0] tl_ucp_team.c:230 TL_UCP DEBUG enable support for memory type cuda-managed [1719127273.519716] [node3:116666:0] tl_ucp_team.c:230 TL_UCP DEBUG enable support for memory type host [1719127273.519721] [node3:116666:0] tl_ucp_team.c:230 TL_UCP DEBUG enable support for memory type cuda [1719127273.519724] [node3:116666:0] tl_ucp_team.c:230 TL_UCP DEBUG enable support for memory type cuda-managed [1719127273.519766] [node3:116663:0] ucc_team.c:472 UCC INFO ===== COLL_SCORE_MAP (team_id 32768, size 4) ===== [1719127273.519779] [node3:116663:0] ucc_coll_score_map.c:201 UCC INFO Allgather: [1719127273.519779] [node3:116663:0] ucc_coll_score_map.c:201 UCC INFO Host: {0..4095}:TL_UCP:10 {4K..inf}:TL_UCP:10 [1719127273.519779] [node3:116663:0] ucc_coll_score_map.c:201 UCC INFO Cuda: {0..4095}:TL_CUDA:10 {4K..inf}:TL_CUDA:10 [1719127273.519779] [node3:116663:0] ucc_coll_score_map.c:201 UCC INFO CudaManaged: {0..4095}:TL_NCCL:10 {4K..inf}:TL_NCCL:10 [1719127273.519789] [node3:116663:0] ucc_coll_score_map.c:201 UCC INFO Allgatherv: [1719127273.519789] [node3:116663:0] ucc_coll_score_map.c:201 UCC INFO Host: {0..inf}:TL_UCP:10 [1719127273.519789] [node3:116663:0] ucc_coll_score_map.c:201 UCC INFO Cuda: {0..16383}:TL_CUDA:10 {16K..1048575}:TL_CUDA:10 {1M..inf}:TL_CUDA:10 [1719127273.519789] [node3:116663:0] ucc_coll_score_map.c:201 UCC INFO CudaManaged: {0..inf}:TL_NCCL:10 [1719127273.519806] [node3:116663:0] ucc_coll_score_map.c:201 UCC INFO Allreduce: [1719127273.519806] [node3:116663:0] ucc_coll_score_map.c:201 UCC INFO Host: {0..4095}:TL_SHM:10 {4K..8K}:TL_SHM:10 {8193..inf}:TL_UCP:10 [1719127273.519806] [node3:116663:0] ucc_coll_score_map.c:201 UCC INFO Cuda: {0..4095}:TL_NCCL:10 {4K..inf}:TL_NCCL:10 [1719127273.519806] [node3:116663:0] ucc_coll_score_map.c:201 UCC INFO CudaManaged: {0..4095}:TL_NCCL:10 {4K..inf}:TL_NCCL:10 [1719127273.519829] [node3:116663:0] ucc_coll_score_map.c:201 UCC INFO Alltoall: [1719127273.519829] [node3:116663:0] ucc_coll_score_map.c:201 UCC INFO Host: {0..515}:TL_UCP:10 {516..inf}:TL_UCP:10 [1719127273.519829] [node3:116663:0] ucc_coll_score_map.c:201 UCC INFO Cuda: {0..inf}:TL_CUDA:10 [1719127273.519829] [node3:116663:0] ucc_coll_score_map.c:201 UCC INFO CudaManaged: {0..inf}:TL_NCCL:10 [1719127273.519841] [node3:116663:0] ucc_coll_score_map.c:201 UCC INFO Alltoallv: [1719127273.519841] [node3:116663:0] ucc_coll_score_map.c:201 UCC INFO Host: {0..inf}:TL_UCP:10 [1719127273.519841] [node3:116663:0] ucc_coll_score_map.c:201 UCC INFO Cuda: {0..inf}:TL_CUDA:10 [1719127273.519841] [node3:116663:0] ucc_coll_score_map.c:201 UCC INFO CudaManaged: {0..inf}:TL_NCCL:10 [1719127273.519850] [node3:116663:0] ucc_coll_score_map.c:201 UCC INFO Barrier: [1719127273.519850] [node3:116663:0] ucc_coll_score_map.c:201 UCC INFO Host: {0..inf}:TL_SHM:10 [1719127273.519850] [node3:116663:0] ucc_coll_score_map.c:201 UCC INFO Cuda: {0..inf}:TL_NCCL:10 [1719127273.519850] [node3:116663:0] ucc_coll_score_map.c:201 UCC INFO CudaManaged: {0..inf}:TL_NCCL:10 [1719127273.519859] [node3:116663:0] ucc_coll_score_map.c:201 UCC INFO Bcast: [1719127273.519859] [node3:116663:0] ucc_coll_score_map.c:201 UCC INFO Host: {0..8K}:TL_SHM:10 {8193..32767}:TL_UCP:10 {32K..inf}:TL_UCP:10 [1719127273.519859] [node3:116663:0] ucc_coll_score_map.c:201 UCC INFO Cuda: {0..32767}:TL_NCCL:10 {32K..inf}:TL_NCCL:10 [1719127273.519859] [node3:116663:0] ucc_coll_score_map.c:201 UCC INFO CudaManaged: {0..32767}:TL_NCCL:10 {32K..inf}:TL_NCCL:10 [1719127273.519870] [node3:116663:0] ucc_coll_score_map.c:201 UCC INFO Fanin: [1719127273.519870] [node3:116663:0] ucc_coll_score_map.c:201 UCC INFO Host: {0..inf}:TL_SHM:10 [1719127273.519870] [node3:116663:0] ucc_coll_score_map.c:201 UCC INFO Cuda: {0..inf}:TL_UCP:10 [1719127273.519870] [node3:116663:0] ucc_coll_score_map.c:201 UCC INFO CudaManaged: {0..inf}:TL_UCP:10 [1719127273.519878] [node3:116663:0] ucc_coll_score_map.c:201 UCC INFO Fanout: [1719127273.519878] [node3:116663:0] ucc_coll_score_map.c:201 UCC INFO Host: {0..inf}:TL_SHM:10 [1719127273.519878] [node3:116663:0] ucc_coll_score_map.c:201 UCC INFO Cuda: {0..inf}:TL_UCP:10 [1719127273.519878] [node3:116663:0] ucc_coll_score_map.c:201 UCC INFO CudaManaged: {0..inf}:TL_UCP:10 [1719127273.519886] [node3:116663:0] ucc_coll_score_map.c:201 UCC INFO Gather: [1719127273.519886] [node3:116663:0] ucc_coll_score_map.c:201 UCC INFO Host: {0..inf}:TL_UCP:10 [1719127273.519886] [node3:116663:0] ucc_coll_score_map.c:201 UCC INFO Cuda: {0..inf}:TL_NCCL:10 [1719127273.519886] [node3:116663:0] ucc_coll_score_map.c:201 UCC INFO CudaManaged: {0..inf}:TL_NCCL:10 [1719127273.519894] [node3:116663:0] ucc_coll_score_map.c:201 UCC INFO Gatherv: [1719127273.519894] [node3:116663:0] ucc_coll_score_map.c:201 UCC INFO Host: {0..inf}:TL_UCP:10 [1719127273.519894] [node3:116663:0] ucc_coll_score_map.c:201 UCC INFO Cuda: {0..inf}:TL_NCCL:10 [1719127273.519894] [node3:116663:0] ucc_coll_score_map.c:201 UCC INFO CudaManaged: {0..inf}:TL_NCCL:10 [1719127273.519902] [node3:116663:0] ucc_coll_score_map.c:201 UCC INFO Reduce: [1719127273.519902] [node3:116663:0] ucc_coll_score_map.c:201 UCC INFO Host: {0..8K}:TL_SHM:10 {8193..inf}:TL_UCP:10 [1719127273.519902] [node3:116663:0] ucc_coll_score_map.c:201 UCC INFO Cuda: {0..inf}:TL_NCCL:10 [1719127273.519902] [node3:116663:0] ucc_coll_score_map.c:201 UCC INFO CudaManaged: {0..inf}:TL_NCCL:10 [1719127273.519911] [node3:116663:0] ucc_coll_score_map.c:201 UCC INFO Reduce_scatter: [1719127273.519911] [node3:116663:0] ucc_coll_score_map.c:201 UCC INFO Host: {0..inf}:TL_UCP:10 [1719127273.519911] [node3:116663:0] ucc_coll_score_map.c:201 UCC INFO Cuda: {0..inf}:TL_CUDA:10 [1719127273.519911] [node3:116663:0] ucc_coll_score_map.c:201 UCC INFO CudaManaged: {0..inf}:TL_NCCL:10 [1719127273.519920] [node3:116663:0] ucc_coll_score_map.c:201 UCC INFO Reduce_scatterv: [1719127273.519920] [node3:116663:0] ucc_coll_score_map.c:201 UCC INFO Host: {0..inf}:TL_UCP:10 [1719127273.519920] [node3:116663:0] ucc_coll_score_map.c:201 UCC INFO Cuda: {0..inf}:TL_CUDA:10 [1719127273.519920] [node3:116663:0] ucc_coll_score_map.c:201 UCC INFO CudaManaged: {0..inf}:TL_UCP:10 [1719127273.519927] [node3:116663:0] ucc_coll_score_map.c:201 UCC INFO Scatter: [1719127273.519927] [node3:116663:0] ucc_coll_score_map.c:201 UCC INFO Cuda: {0..inf}:TL_NCCL:10 [1719127273.519927] [node3:116663:0] ucc_coll_score_map.c:201 UCC INFO CudaManaged: {0..inf}:TL_NCCL:10 [1719127273.519933] [node3:116663:0] ucc_coll_score_map.c:201 UCC INFO Scatterv: [1719127273.519933] [node3:116663:0] ucc_coll_score_map.c:201 UCC INFO Host: {0..inf}:TL_UCP:10 [1719127273.519933] [node3:116663:0] ucc_coll_score_map.c:201 UCC INFO Cuda: {0..inf}:TL_NCCL:10 [1719127273.519933] [node3:116663:0] ucc_coll_score_map.c:201 UCC INFO CudaManaged: {0..inf}:TL_NCCL:10 [1719127273.519939] [node3:116663:0] ucc_team.c:474 UCC INFO ================================================ [2024-06-23 12:51:13][cal][116664][Api][cal_comm_get_rank] comm=[node3 [0]:1] rank=0x4106790 [2024-06-23 12:51:13][cal][116664][Api][cal_comm_get_rank] comm=[node3 [0]:1] rank=0x4103cc0 [2024-06-23 12:51:13][cal][116664][Api][cal_comm_get_rank] comm=[node3 [0]:1] rank=0x7ffe3ab8d070 [2024-06-23 12:51:13][cal][116664][Api][cal_comm_get_tls] comm=[node3 [0]:1] tls=0x7ffe3ab8ce60 [2024-06-23 12:51:13][cal][116664][Api][cal_comm_get_rank] comm=[node3 [0]:1] rank=0x7ffe3ab8cd34 [2024-06-23 12:51:13][cal][116664][Api][cal_comm_split] comm=[node3 [0]:1] color=1 key=1 new_comm=0x7ffe3ab8d120 [2024-06-23 12:51:13][cal][116664][Trace][cal_comm_split] UCC allgather in-place [1719127273.553674] [node3:116664:0] ec_cpu.c:73 cpu ec DEBUG executor init, eee: 0x4108980 [2024-06-23 12:51:13][cal][116665][Api][cal_comm_get_rank] comm=[node3 [0]:2] rank=0x333fb80 [2024-06-23 12:51:13][cal][116665][Api][cal_comm_get_rank] comm=[node3 [0]:2] rank=0x333d0b0 [2024-06-23 12:51:13][cal][116665][Api][cal_comm_get_rank] comm=[node3 [0]:2] rank=0x7ffd36b9ecc0 [2024-06-23 12:51:13][cal][116665][Api][cal_comm_get_tls] comm=[node3 [0]:2] tls=0x7ffd36b9eab0 [2024-06-23 12:51:13][cal][116665][Api][cal_comm_get_rank] comm=[node3 [0]:2] rank=0x7ffd36b9e984 [2024-06-23 12:51:13][cal][116665][Api][cal_comm_split] comm=[node3 [0]:2] color=1 key=0 new_comm=0x7ffd36b9ed70 [2024-06-23 12:51:13][cal][116665][Trace][cal_comm_split] UCC allgather in-place [1719127273.554691] [node3:116665:0] ec_cpu.c:73 cpu ec DEBUG executor init, eee: 0x3341d80 [2024-06-23 12:51:13][cal][116663][Api][cal_comm_get_rank] comm=[node3 [0]:0] rank=0x2970290 [2024-06-23 12:51:13][cal][116663][Api][cal_comm_get_rank] comm=[node3 [0]:0] rank=0x296d6c0 [2024-06-23 12:51:13][cal][116663][Api][cal_comm_get_rank] comm=[node3 [0]:0] rank=0x7ffc8cee3610 [2024-06-23 12:51:13][cal][116663][Api][cal_comm_get_tls] comm=[node3 [0]:0] tls=0x7ffc8cee3400 [2024-06-23 12:51:13][cal][116663][Api][cal_comm_get_rank] comm=[node3 [0]:0] rank=0x7ffc8cee32d4 [2024-06-23 12:51:13][cal][116663][Api][cal_comm_split] comm=[node3 [0]:0] color=1 key=0 new_comm=0x7ffc8cee36c0 [2024-06-23 12:51:13][cal][116663][Trace][cal_comm_split] UCC allgather in-place [1719127273.557674] [node3:116663:0] ec_cpu.c:73 cpu ec DEBUG executor init, eee: 0x29724c0 [2024-06-23 12:51:13][cal][116666][Api][cal_comm_get_rank] comm=[node3 [0]:3] rank=0x3159cb0 [2024-06-23 12:51:13][cal][116666][Api][cal_comm_get_rank] comm=[node3 [0]:3] rank=0x31578c0 [2024-06-23 12:51:13][cal][116666][Api][cal_comm_get_rank] comm=[node3 [0]:3] rank=0x7ffdf9995290 [2024-06-23 12:51:13][cal][116666][Api][cal_comm_get_tls] comm=[node3 [0]:3] tls=0x7ffdf9995080 [2024-06-23 12:51:13][cal][116666][Api][cal_comm_get_rank] comm=[node3 [0]:3] rank=0x7ffdf9994f54 [2024-06-23 12:51:13][cal][116666][Api][cal_comm_split] comm=[node3 [0]:3] color=1 key=1 new_comm=0x7ffdf9995340 [2024-06-23 12:51:13][cal][116666][Trace][cal_comm_split] UCC allgather in-place [1719127273.558843] [node3:116666:0] ec_cpu.c:73 cpu ec DEBUG executor init, eee: 0x315bf40 [1719127273.559540] [node3:116665:0] ec_cpu.c:186 cpu ec DEBUG executor finalize, eee: 0x3341d80 [1719127273.559564] [node3:116665:0] ucc_team.c:370 UCC DEBUG team 0x33f1890 rank 1, ctx_rank 2, map_type 3 [1719127273.559572] [node3:116665:0] ucc_tl.c:294 TL_SELF DEBUG team size 4 is too big, max supported 1 [1719127273.559575] [node3:116665:0] cl_basic_team.c:52 CL_BASIC DEBUG posted cl team: 0x23feb40 [1719127273.560041] [node3:116663:0] ec_cpu.c:186 cpu ec DEBUG executor finalize, eee: 0x29724c0 [1719127273.560065] [node3:116663:0] ucc_team.c:370 UCC DEBUG team 0x2a21fe0 rank 0, ctx_rank 0, map_type 3 [1719127273.560073] [node3:116663:0] ucc_tl.c:294 TL_SELF DEBUG team size 4 is too big, max supported 1 [1719127273.560076] [node3:116663:0] cl_basic_team.c:52 CL_BASIC DEBUG posted cl team: 0x1a2ed70 [1719127273.560041] [node3:116664:0] ec_cpu.c:186 cpu ec DEBUG executor finalize, eee: 0x4108980 [1719127273.560064] [node3:116664:0] ucc_team.c:370 UCC DEBUG team 0x41b85c0 rank 2, ctx_rank 1, map_type 3 [1719127273.560071] [node3:116664:0] ucc_tl.c:294 TL_SELF DEBUG team size 4 is too big, max supported 1 [1719127273.560074] [node3:116664:0] cl_basic_team.c:52 CL_BASIC DEBUG posted cl team: 0x31c5980 [1719127273.560044] [node3:116666:0] ec_cpu.c:186 cpu ec DEBUG executor finalize, eee: 0x315bf40 [1719127273.560067] [node3:116666:0] ucc_team.c:370 UCC DEBUG team 0x320ba60 rank 3, ctx_rank 3, map_type 3 [1719127273.560075] [node3:116666:0] ucc_tl.c:294 TL_SELF DEBUG team size 4 is too big, max supported 1 [1719127273.560077] [node3:116666:0] cl_basic_team.c:52 CL_BASIC DEBUG posted cl team: 0x22188b0 [1719127273.560081] [node3:116663:0] tl_shm_team.c:158 TL_SHM DEBUG using perf params: generic [1719127273.561044] [node3:116663:0] tl_ucp_team.c:83 UCC DEBUG section not found [1719127273.561050] [node3:116663:0] tl_ucp_team.c:101 TL_UCP DEBUG posted tl team: 0x2a62230 [1719127273.561052] [node3:116663:0] tl_ucp_team.c:200 TL_UCP DEBUG initialized tl team: 0x2a62230 [1719127273.561054] [node3:116663:0] cl_basic_team.c:126 CL_BASIC DEBUG failed to create tl self team: (-1) [1719127273.561056] [node3:116663:0] cl_basic_team.c:122 CL_BASIC DEBUG initialized tl shm team [1719127273.561057] [node3:116663:0] cl_basic_team.c:122 CL_BASIC DEBUG initialized tl ucp team [1719127273.561059] [node3:116663:0] tl_ucp_team.c:230 TL_UCP DEBUG enable support for memory type host [1719127273.561061] [node3:116663:0] tl_ucp_team.c:230 TL_UCP DEBUG enable support for memory type cuda [1719127273.561062] [node3:116663:0] tl_ucp_team.c:230 TL_UCP DEBUG enable support for memory type cuda-managed [1719127273.561045] [node3:116664:0] tl_ucp_team.c:83 UCC DEBUG section not found [1719127273.561051] [node3:116664:0] tl_ucp_team.c:101 TL_UCP DEBUG posted tl team: 0x41f86f0 [1719127273.561054] [node3:116664:0] tl_ucp_team.c:200 TL_UCP DEBUG initialized tl team: 0x41f86f0 [1719127273.561056] [node3:116664:0] cl_basic_team.c:126 CL_BASIC DEBUG failed to create tl self team: (-1) [1719127273.561057] [node3:116664:0] cl_basic_team.c:122 CL_BASIC DEBUG initialized tl shm team [1719127273.561059] [node3:116664:0] cl_basic_team.c:122 CL_BASIC DEBUG initialized tl ucp team [1719127273.561063] [node3:116664:0] tl_ucp_team.c:230 TL_UCP DEBUG enable support for memory type host [1719127273.561064] [node3:116664:0] tl_ucp_team.c:230 TL_UCP DEBUG enable support for memory type cuda [1719127273.561066] [node3:116664:0] tl_ucp_team.c:230 TL_UCP DEBUG enable support for memory type cuda-managed [1719127273.561045] [node3:116665:0] tl_ucp_team.c:83 UCC DEBUG section not found [1719127273.561052] [node3:116665:0] tl_ucp_team.c:101 TL_UCP DEBUG posted tl team: 0x3431ae0 [1719127273.561054] [node3:116665:0] tl_ucp_team.c:200 TL_UCP DEBUG initialized tl team: 0x3431ae0 [1719127273.561056] [node3:116665:0] cl_basic_team.c:126 CL_BASIC DEBUG failed to create tl self team: (-1) [1719127273.561058] [node3:116665:0] cl_basic_team.c:122 CL_BASIC DEBUG initialized tl shm team [1719127273.561059] [node3:116665:0] cl_basic_team.c:122 CL_BASIC DEBUG initialized tl ucp team [1719127273.561062] [node3:116665:0] tl_ucp_team.c:230 TL_UCP DEBUG enable support for memory type host [1719127273.561063] [node3:116665:0] tl_ucp_team.c:230 TL_UCP DEBUG enable support for memory type cuda [1719127273.561064] [node3:116665:0] tl_ucp_team.c:230 TL_UCP DEBUG enable support for memory type cuda-managed [1719127273.561046] [node3:116666:0] tl_ucp_team.c:83 UCC DEBUG section not found [1719127273.561052] [node3:116666:0] tl_ucp_team.c:101 TL_UCP DEBUG posted tl team: 0x324bcb0 [1719127273.561054] [node3:116666:0] tl_ucp_team.c:200 TL_UCP DEBUG initialized tl team: 0x324bcb0 [1719127273.561056] [node3:116666:0] cl_basic_team.c:126 CL_BASIC DEBUG failed to create tl self team: (-1) [1719127273.561058] [node3:116666:0] cl_basic_team.c:122 CL_BASIC DEBUG initialized tl shm team [1719127273.561059] [node3:116666:0] cl_basic_team.c:122 CL_BASIC DEBUG initialized tl ucp team [1719127273.561061] [node3:116666:0] tl_ucp_team.c:230 TL_UCP DEBUG enable support for memory type host [1719127273.561063] [node3:116666:0] tl_ucp_team.c:230 TL_UCP DEBUG enable support for memory type cuda [1719127273.561064] [node3:116666:0] tl_ucp_team.c:230 TL_UCP DEBUG enable support for memory type cuda-managed [1719127273.561106] [node3:116663:0] ucc_team.c:472 UCC INFO ===== COLL_SCORE_MAP (team_id 32769, size 4) ===== [1719127273.561115] [node3:116663:0] ucc_coll_score_map.c:201 UCC INFO Allgather: [1719127273.561115] [node3:116663:0] ucc_coll_score_map.c:201 UCC INFO Host: {0..4095}:TL_UCP:10 {4K..inf}:TL_UCP:10 [1719127273.561115] [node3:116663:0] ucc_coll_score_map.c:201 UCC INFO Cuda: {0..4095}:TL_UCP:10 {4K..inf}:TL_UCP:10 [1719127273.561115] [node3:116663:0] ucc_coll_score_map.c:201 UCC INFO CudaManaged: {0..4095}:TL_UCP:10 {4K..inf}:TL_UCP:10 [1719127273.561124] [node3:116663:0] ucc_coll_score_map.c:201 UCC INFO Allgatherv: [1719127273.561124] [node3:116663:0] ucc_coll_score_map.c:201 UCC INFO Host: {0..inf}:TL_UCP:10 [1719127273.561124] [node3:116663:0] ucc_coll_score_map.c:201 UCC INFO Cuda: {0..inf}:TL_UCP:10 [1719127273.561124] [node3:116663:0] ucc_coll_score_map.c:201 UCC INFO CudaManaged: {0..inf}:TL_UCP:10 [1719127273.561134] [node3:116663:0] ucc_coll_score_map.c:201 UCC INFO Allreduce: [1719127273.561134] [node3:116663:0] ucc_coll_score_map.c:201 UCC INFO Host: {0..4095}:TL_SHM:10 {4K..8K}:TL_SHM:10 {8193..inf}:TL_UCP:10 [1719127273.561134] [node3:116663:0] ucc_coll_score_map.c:201 UCC INFO Cuda: {0..4095}:TL_UCP:10 {4K..inf}:TL_UCP:10 [1719127273.561134] [node3:116663:0] ucc_coll_score_map.c:201 UCC INFO CudaManaged: {0..4095}:TL_UCP:10 {4K..inf}:TL_UCP:10 [1719127273.561144] [node3:116663:0] ucc_coll_score_map.c:201 UCC INFO Alltoall: [1719127273.561144] [node3:116663:0] ucc_coll_score_map.c:201 UCC INFO Host: {0..515}:TL_UCP:10 {516..inf}:TL_UCP:10 [1719127273.561144] [node3:116663:0] ucc_coll_score_map.c:201 UCC INFO Cuda: {0..inf}:TL_UCP:10 [1719127273.561144] [node3:116663:0] ucc_coll_score_map.c:201 UCC INFO CudaManaged: {0..inf}:TL_UCP:10 [2024-06-23 12:51:13][cal][116666][Api][cal_comm_get_rank] comm=[node3 [0]:3] rank=0x7ffdf9995290 [2024-06-23 12:51:13][cal][116666][Api][cal_stream_sync] comm=[node3 [0]:3] stream=0x2a86320 [2024-06-23 12:51:13][cal][116666][Api][cal_comm_get_rank] comm=[node3 [0]:3] rank=0x7ffdf99943d0 [2024-06-23 12:51:13][cal][116665][Api][cal_comm_get_rank] comm=[node3 [0]:2] rank=0x7ffd36b9ecc0 [2024-06-23 12:51:13][cal][116665][Api][cal_stream_sync] comm=[node3 [0]:2] stream=0x2a90c20 [2024-06-23 12:51:13][cal][116665][Api][cal_comm_get_rank] comm=[node3 [0]:2] rank=0x7ffd36b9de00 [2024-06-23 12:51:13][cal][116665][Api][cal_comm_get_rank] comm=[node3 [0]:2] rank=0x7ffd36b9da24 [2024-06-23 12:51:13][cal][116665][Api][cal_comm_split] comm=[node3 [0]:2] color=1 key=1 new_comm=0x7ffd36b9de58 [1719127273.561152] [node3:116663:0] ucc_coll_score_map.c:201 UCC INFO Alltoallv: [1719127273.561152] [node3:116663:0] ucc_coll_score_map.c:201 UCC INFO Host: {0..inf}:TL_UCP:10 [1719127273.561152] [node3:116663:0] ucc_coll_score_map.c:201 UCC INFO Cuda: {0..inf}:TL_UCP:10 [1719127273.561152] [node3:116663:0] ucc_coll_score_map.c:201 UCC INFO CudaManaged: {0..inf}:TL_UCP:10 [1719127273.561161] [node3:116663:0] ucc_coll_score_map.c:201 UCC INFO Barrier: [1719127273.561161] [node3:116663:0] ucc_coll_score_map.c:201 UCC INFO Host: {0..inf}:TL_SHM:10 [1719127273.561161] [node3:116663:0] ucc_coll_score_map.c:201 UCC INFO Cuda: {0..inf}:TL_UCP:10 [1719127273.561161] [node3:116663:0] ucc_coll_score_map.c:201 UCC INFO CudaManaged: {0..inf}:TL_UCP:10 [1719127273.561170] [node3:116663:0] ucc_coll_score_map.c:201 UCC INFO Bcast: [1719127273.561170] [node3:116663:0] ucc_coll_score_map.c:201 UCC INFO Host: {0..8K}:TL_SHM:10 {8193..32767}:TL_UCP:10 {32K..inf}:TL_UCP:10 [1719127273.561170] [node3:116663:0] ucc_coll_score_map.c:201 UCC INFO Cuda: {0..32767}:TL_UCP:10 {32K..inf}:TL_UCP:10 [1719127273.561170] [node3:116663:0] ucc_coll_score_map.c:201 UCC INFO CudaManaged: {0..32767}:TL_UCP:10 {32K..inf}:TL_UCP:10 [1719127273.561178] [node3:116663:0] ucc_coll_score_map.c:201 UCC INFO Fanin: [1719127273.561178] [node3:116663:0] ucc_coll_score_map.c:201 UCC INFO Host: {0..inf}:TL_SHM:10 [1719127273.561178] [node3:116663:0] ucc_coll_score_map.c:201 UCC INFO Cuda: {0..inf}:TL_UCP:10 [1719127273.561178] [node3:116663:0] ucc_coll_score_map.c:201 UCC INFO CudaManaged: {0..inf}:TL_UCP:10 [1719127273.561186] [node3:116663:0] ucc_coll_score_map.c:201 UCC INFO Fanout: [1719127273.561186] [node3:116663:0] ucc_coll_score_map.c:201 UCC INFO Host: {0..inf}:TL_SHM:10 [1719127273.561186] [node3:116663:0] ucc_coll_score_map.c:201 UCC INFO Cuda: {0..inf}:TL_UCP:10 [1719127273.561186] [node3:116663:0] ucc_coll_score_map.c:201 UCC INFO CudaManaged: {0..inf}:TL_UCP:10 [1719127273.561195] [node3:116663:0] ucc_coll_score_map.c:201 UCC INFO Gather: [1719127273.561195] [node3:116663:0] ucc_coll_score_map.c:201 UCC INFO Host: {0..inf}:TL_UCP:10 [1719127273.561195] [node3:116663:0] ucc_coll_score_map.c:201 UCC INFO Cuda: {0..inf}:TL_UCP:10 [1719127273.561195] [node3:116663:0] ucc_coll_score_map.c:201 UCC INFO CudaManaged: {0..inf}:TL_UCP:10 [1719127273.561202] [node3:116663:0] ucc_coll_score_map.c:201 UCC INFO Gatherv: [1719127273.561202] [node3:116663:0] ucc_coll_score_map.c:201 UCC INFO Host: {0..inf}:TL_UCP:10 [1719127273.561202] [node3:116663:0] ucc_coll_score_map.c:201 UCC INFO Cuda: {0..inf}:TL_UCP:10 [1719127273.561202] [node3:116663:0] ucc_coll_score_map.c:201 UCC INFO CudaManaged: {0..inf}:TL_UCP:10 [2024-06-23 12:51:13][cal][116666][Api][cal_comm_get_rank] comm=[node3 [0]:3] rank=0x7ffdf9993ff4 [2024-06-23 12:51:13][cal][116666][Api][cal_comm_split] comm=[node3 [0]:3] color=1 key=0 new_comm=0x7ffdf9994428 [2024-06-23 12:51:13][cal][116666][Trace][cal_comm_split] UCC allgather in-place [1719127273.561186] [node3:116666:0] ec_cpu.c:73 cpu ec DEBUG executor init, eee: 0x315bf40 [2024-06-23 12:51:13][cal][116665][Trace][cal_comm_split] UCC allgather in-place [1719127273.561187] [node3:116665:0] ec_cpu.c:73 cpu ec DEBUG executor init, eee: 0x3341d80 [1719127273.561210] [node3:116663:0] ucc_coll_score_map.c:201 UCC INFO Reduce: [1719127273.561210] [node3:116663:0] ucc_coll_score_map.c:201 UCC INFO Host: {0..8K}:TL_SHM:10 {8193..inf}:TL_UCP:10 [1719127273.561210] [node3:116663:0] ucc_coll_score_map.c:201 UCC INFO Cuda: {0..inf}:TL_UCP:10 [1719127273.561210] [node3:116663:0] ucc_coll_score_map.c:201 UCC INFO CudaManaged: {0..inf}:TL_UCP:10 [1719127273.561218] [node3:116663:0] ucc_coll_score_map.c:201 UCC INFO Reduce_scatter: [1719127273.561218] [node3:116663:0] ucc_coll_score_map.c:201 UCC INFO Host: {0..inf}:TL_UCP:10 [1719127273.561218] [node3:116663:0] ucc_coll_score_map.c:201 UCC INFO Cuda: {0..inf}:TL_UCP:10 [1719127273.561218] [node3:116663:0] ucc_coll_score_map.c:201 UCC INFO CudaManaged: {0..inf}:TL_UCP:10 [1719127273.561226] [node3:116663:0] ucc_coll_score_map.c:201 UCC INFO Reduce_scatterv: [1719127273.561226] [node3:116663:0] ucc_coll_score_map.c:201 UCC INFO Host: {0..inf}:TL_UCP:10 [1719127273.561226] [node3:116663:0] ucc_coll_score_map.c:201 UCC INFO Cuda: {0..inf}:TL_UCP:10 [1719127273.561226] [node3:116663:0] ucc_coll_score_map.c:201 UCC INFO CudaManaged: {0..inf}:TL_UCP:10 [1719127273.561234] [node3:116663:0] ucc_coll_score_map.c:201 UCC INFO Scatterv: [1719127273.561234] [node3:116663:0] ucc_coll_score_map.c:201 UCC INFO Host: {0..inf}:TL_UCP:10 [1719127273.561234] [node3:116663:0] ucc_coll_score_map.c:201 UCC INFO Cuda: {0..inf}:TL_UCP:10 [1719127273.561234] [node3:116663:0] ucc_coll_score_map.c:201 UCC INFO CudaManaged: {0..inf}:TL_UCP:10 [1719127273.561241] [node3:116663:0] ucc_team.c:474 UCC INFO ================================================ [2024-06-23 12:51:13][cal][116663][Api][cal_send] comm=[node3 [0->0(1)]:0] count=60 type=CUDA_R_64F, data=0x21b98a0 dst_rank=1 tag=0 stream=0x229c520 [2024-06-23 12:51:13][cal][116663][Trace][cal_send] ucc_transport::send() 0 -> 1, 480 bytes, tag: 0 [1719127273.562446] [node3:116663:0] ucc_coll_score_map.c:142 UCC DEBUG coll Bcast is not supported for TL_SHM, fallback TL_UCP [2024-06-23 12:51:13][cal][116664][Api][cal_recv] comm=[node3 [0->0(1)]:2] count=60 type=CUDA_R_64F data=0x394fd90 src_rank=0 tag=0 stream=0x3987030 [2024-06-23 12:51:13][cal][116664][Trace][cal_recv] ucc_transport::recv() 2 <- 0, 480 bytes, tag: 0 [1719127273.562443] [node3:116664:0] ucc_coll_score_map.c:142 UCC DEBUG coll Bcast is not supported for TL_SHM, fallback TL_UCP [2024-06-23 12:51:13][cal][116663][Api][cal_comm_get_rank] comm=[node3 [0]:0] rank=0x7ffc8cee3610 [2024-06-23 12:51:13][cal][116663][Api][cal_send] comm=[node3 [0->0(1)]:0] count=12 type=CUDA_R_64F, data=0x2a63750 dst_rank=1 tag=0 stream=0x229c520 [2024-06-23 12:51:13][cal][116663][Trace][cal_send] ucc_transport::send() 0 -> 1, 96 bytes, tag: 0 [1719127273.562516] [node3:116663:0] ucc_coll_score_map.c:142 UCC DEBUG coll Bcast is not supported for TL_SHM, fallback TL_UCP [2024-06-23 12:51:13][cal][116663][Api][cal_stream_sync] comm=[node3 [0]:0] stream=0x229c520 [2024-06-23 12:51:13][cal][116663][Api][cal_comm_get_rank] comm=[node3 [0]:0] rank=0x7ffc8cee2750 [2024-06-23 12:51:13][cal][116663][Api][cal_comm_get_rank] comm=[node3 [0]:0] rank=0x7ffc8cee2374 [2024-06-23 12:51:13][cal][116663][Api][cal_comm_split] comm=[node3 [0]:0] color=1 key=1 new_comm=0x7ffc8cee27a8 [2024-06-23 12:51:13][cal][116663][Trace][cal_comm_split] UCC allgather in-place [1719127273.562561] [node3:116663:0] ec_cpu.c:73 cpu ec DEBUG executor init, eee: 0x29724c0

Any help in this regard will prove beneficial.

Many Thanks pushkar

mrogowski commented 2 months ago

@ppandit95 sorry for the delays in getting back to you. I still do not see anything obviously wrong in your logs. Messages like UCC DEBUG coll Bcast is not supported for TL_SHM, fallback TL_UCP are not critical. I do not think the issue is in cuSOLVERMp because we routinely test a similar configuration (4xA100). It could be some issue in the software stack (perhaps UCC) or system configuration.

  1. Do you see a hang, or does the application terminate?
  2. Do you see the same issue with every example when run with 4 processes?
  3. Can you try running inside a Docker container? For example, one of https://catalog.ngc.nvidia.com/orgs/nvidia/containers/nvhpc/tags. If you have the same issue inside a container and it is something we can reproduce, we can solve this quickly.
ppandit95 commented 2 months ago

@mrogowski , firstly thanks for looking into the output log and the program hangs as soon as run with n=3 or n=4 and this is the issue with every example using n=3 and n=4.Also as suggested,I shall tryout torun inside Docker Container

mrogowski commented 2 months ago

Oh, I just realized something. Please look at the source code of our examples - those are hardcoded to run with 2 processes! That's most likely why it doesn't work for you with 3 and 4 processes. You will have to modify the source code and recompile it.

See https://github.com/NVIDIA/CUDALibrarySamples/blob/master/cuSOLVERMp/mp_potrf_potrs.c#L145 and later https://github.com/NVIDIA/CUDALibrarySamples/blob/master/cuSOLVERMp/mp_potrf_potrs.c#L173.

p and q determine the number of row and column devices, so p*q has to be equal to the number of MPI ranks.

ppandit95 commented 2 months ago

ohh great @mrogowski thanks alot for that....now the code is running very well with 4 GPUs as well