NVIDIA / CUDALibrarySamples

CUDA Library Samples
Other
1.5k stars 311 forks source link

cusolverMP failed when cal_comm_create with "UCC ERROR failed to allocate 281024393381230 bytes for addr storage" #191

Open goodchong opened 2 months ago

goodchong commented 2 months ago

dear all, why ucc try to allocate 281024393381230 bytes? it a astronomical figure. What I'm trying to solver is a fairly small problem for testing cusolvermp purposes. I'm using nvcr.io/nvidia/nvhpc:24.5-devel-cuda12.4-ubuntu22.04 docker image. with CAL_LOG_LEVEL=6 UCC_LOG_LEVEL=DEBUG UCX_LOG_LEVEL=debug

[1718971545.272051] [9d1d1b67da66:2472 :0] topo.c:861 UCX DEBUG /sys/class/net/eth0: sysfs path undetected [1718971545.272057] [9d1d1b67da66:2472 :0] topo.c:818 UCX DEBUG eth0: pci bandwidth undetected, using maximal value [1718971545.272137] [9d1d1b67da66:2472 :0] topo.c:861 UCX DEBUG /sys/class/net/eth0: sysfs path undetected [1718971545.272143] [9d1d1b67da66:2472 :0] topo.c:818 UCX DEBUG eth0: pci bandwidth undetected, using maximal value [1718971545.272406] [9d1d1b67da66:2472 :0] async.c:232 UCX DEBUG added async handler 0x55e507df2f70 [id=85 ref 1] uct_rdmacm_cm_event_handler() to hash [1718971545.272420] [9d1d1b67da66:2472 :0] async.c:494 UCX DEBUG listening to async event fd 85 events 0x1 mode thread_spinlock [1718971545.272426] [9d1d1b67da66:2472 :0] rdmacm_cm.c:982 UCX DEBUG created rdmacm_cm 0x55e507df35f0 with event_channel 0x55e507df8d70 (fd=85) [1718971545.272439] [9d1d1b67da66:2472 :0] tcp_sockcm.c:225 UCX DEBUG created tcp_sockcm 0x55e507df3b60 [1718971545.272446] [9d1d1b67da66:2472 :0] mpool.c:138 UCX DEBUG mpool ucp_requests: align 64, maxelems 4294967295, elemsize 272 [1718971545.272453] [9d1d1b67da66:2472 :0] mpool.c:138 UCX DEBUG mpool ucp_rkeys: align 64, maxelems 4294967295, elemsize 104 [1718971545.272457] [9d1d1b67da66:2472 :0] mpool.c:138 UCX DEBUG mpool ucp_reg_bufs: align 64, maxelems 4294967295, elemsize 8216 [1718971545.272464] [9d1d1b67da66:2472 :0] mpool.c:138 UCX DEBUG mpool ucp_am_bufs: align 64, maxelems 4294967295, elemsize 153 [1718971545.272470] [9d1d1b67da66:2472 :0] mpool.c:138 UCX DEBUG mpool ucp_am_bufs: align 64, maxelems 4294967295, elemsize 1113 [1718971545.272477] [9d1d1b67da66:2472 :0] mpool.c:138 UCX DEBUG mpool ucp_am_bufs: align 64, maxelems 4294967295, elemsize 65625 [1718971545.272483] [9d1d1b67da66:2472 :0] mpool_set.c:130 UCX DEBUG mpool_set:ucp_am_bufs, sizes map 0x80000440, largest size 65536, mpools num 3 [1718971545.272509] [9d1d1b67da66:2472 :0] mpool.c:138 UCX DEBUG mpool tl_ucp_req_mp: align 64, maxelems 4294967295, elemsize 600 [1718971545.272516] [9d1d1b67da66:2472 :0] tl_ucp_context.c:277 TL_UCP DEBUG initialized tl context: 0x55e509764740 [1718971545.272523] [9d1d1b67da66:2472 :0] cl_basic_context.c:39 CL_BASIC DEBUG TL cuda context is not available, skipping [1718971545.272529] [9d1d1b67da66:2472 :0] cl_basic_context.c:39 CL_BASIC DEBUG TL nccl context is not available, skipping [1718971545.272537] [9d1d1b67da66:2472 :0] cl_basic_context.c:50 CL_BASIC DEBUG initialized cl context: 0x55e507df8dd0 [1718971545.272737] [9d1d1b67da66:2472 :0] ucc_context.c:518 UCC ERROR failed to allocate 281024393381230 bytes for addr storage [1718971545.272744] [9d1d1b67da66:2472 :0] ucc_context.c:726 UCC ERROR failed to exchange addresses during context creation [1718971545.272750] [9d1d1b67da66:2472 :0] cl_basic_context.c:57 CL_BASIC DEBUG finalizing cl context: 0x55e507df8dd0 [2024-06-21 12:05:45][cal][2472][Error][cal_comm_create] Error #-4 in /home/jenkins/agent/workspace/libcal/helpers/master/L0_MergeRequest/build/src/ucc_context.h:123

[1718971545.272920] [9d1d1b67da66:2472 :0] mpool.c:194 UCX DEBUG mpool stub_tasks destroyed [1718971545.272929] [9d1d1b67da66:2472 :0] tl_cuda_lib.c:41 TL_CUDA DEBUG finalizing lib object: 0x55e507703630 [1718971545.272935] [9d1d1b67da66:2472 :0] tl_mlx5_lib.c:25 TL_MLX5 DEBUG finalizing lib object: 0x55e5021c68e0 [1718971545.272942] [9d1d1b67da66:2472 :0] tl_nccl_lib.c:22 TL_NCCL DEBUG finalizing lib object: 0x55e5076f8fa0 [1718971545.272949] [9d1d1b67da66:2472 :0] tl_self_lib.c:26 TL_SELF DEBUG finalizing lib object: 0x55e507704480 [1718971545.272956] [9d1d1b67da66:2472 :0] tl_sharp_lib.c:26 TL_SHARP DEBUG finalizing lib object: 0x55e50976cf00 [1718971545.272964] [9d1d1b67da66:2472 :0] tl_shm_lib.c:30 TL_SHM DEBUG finalizing lib object: 0x55e5076f8590 [1718971545.272972] [9d1d1b67da66:2472 :0] tl_ucp_lib.c:83 TL_UCP DEBUG finalizing lib object: 0x55e507706bf0 [1718971545.272978] [9d1d1b67da66:2472 :0] cl_basic_lib.c:26 CL_BASIC DEBUG finalizing lib object: 0x55e507707340 [2024-06-21 12:05:45][cal][2472][Error][cal_comm_create] CAL Error #6 in /home/jenkins/agent/workspace/libcal/helpers/master/L0_MergeRequest/build/src/ucc_context.h:123, ucc_context_create

mrogowski commented 2 months ago

Hello @goodchong, and sorry for the late reply. Could you provide steps to reproduce this issue? Have you changed the allgather function that is passed to cal?