dear all,
why ucc try to allocate 281024393381230 bytes? it a astronomical figure.
What I'm trying to solver is a fairly small problem for testing cusolvermp purposes.
I'm using nvcr.io/nvidia/nvhpc:24.5-devel-cuda12.4-ubuntu22.04 docker image.
with CAL_LOG_LEVEL=6 UCC_LOG_LEVEL=DEBUG UCX_LOG_LEVEL=debug
Hello @goodchong, and sorry for the late reply. Could you provide steps to reproduce this issue? Have you changed the allgather function that is passed to cal?
dear all, why ucc try to allocate 281024393381230 bytes? it a astronomical figure. What I'm trying to solver is a fairly small problem for testing cusolvermp purposes. I'm using nvcr.io/nvidia/nvhpc:24.5-devel-cuda12.4-ubuntu22.04 docker image. with CAL_LOG_LEVEL=6 UCC_LOG_LEVEL=DEBUG UCX_LOG_LEVEL=debug
[1718971545.272051] [9d1d1b67da66:2472 :0] topo.c:861 UCX DEBUG /sys/class/net/eth0: sysfs path undetected [1718971545.272057] [9d1d1b67da66:2472 :0] topo.c:818 UCX DEBUG eth0: pci bandwidth undetected, using maximal value [1718971545.272137] [9d1d1b67da66:2472 :0] topo.c:861 UCX DEBUG /sys/class/net/eth0: sysfs path undetected [1718971545.272143] [9d1d1b67da66:2472 :0] topo.c:818 UCX DEBUG eth0: pci bandwidth undetected, using maximal value [1718971545.272406] [9d1d1b67da66:2472 :0] async.c:232 UCX DEBUG added async handler 0x55e507df2f70 [id=85 ref 1] uct_rdmacm_cm_event_handler() to hash [1718971545.272420] [9d1d1b67da66:2472 :0] async.c:494 UCX DEBUG listening to async event fd 85 events 0x1 mode thread_spinlock [1718971545.272426] [9d1d1b67da66:2472 :0] rdmacm_cm.c:982 UCX DEBUG created rdmacm_cm 0x55e507df35f0 with event_channel 0x55e507df8d70 (fd=85) [1718971545.272439] [9d1d1b67da66:2472 :0] tcp_sockcm.c:225 UCX DEBUG created tcp_sockcm 0x55e507df3b60 [1718971545.272446] [9d1d1b67da66:2472 :0] mpool.c:138 UCX DEBUG mpool ucp_requests: align 64, maxelems 4294967295, elemsize 272 [1718971545.272453] [9d1d1b67da66:2472 :0] mpool.c:138 UCX DEBUG mpool ucp_rkeys: align 64, maxelems 4294967295, elemsize 104 [1718971545.272457] [9d1d1b67da66:2472 :0] mpool.c:138 UCX DEBUG mpool ucp_reg_bufs: align 64, maxelems 4294967295, elemsize 8216 [1718971545.272464] [9d1d1b67da66:2472 :0] mpool.c:138 UCX DEBUG mpool ucp_am_bufs: align 64, maxelems 4294967295, elemsize 153 [1718971545.272470] [9d1d1b67da66:2472 :0] mpool.c:138 UCX DEBUG mpool ucp_am_bufs: align 64, maxelems 4294967295, elemsize 1113 [1718971545.272477] [9d1d1b67da66:2472 :0] mpool.c:138 UCX DEBUG mpool ucp_am_bufs: align 64, maxelems 4294967295, elemsize 65625 [1718971545.272483] [9d1d1b67da66:2472 :0] mpool_set.c:130 UCX DEBUG mpool_set:ucp_am_bufs, sizes map 0x80000440, largest size 65536, mpools num 3 [1718971545.272509] [9d1d1b67da66:2472 :0] mpool.c:138 UCX DEBUG mpool tl_ucp_req_mp: align 64, maxelems 4294967295, elemsize 600 [1718971545.272516] [9d1d1b67da66:2472 :0] tl_ucp_context.c:277 TL_UCP DEBUG initialized tl context: 0x55e509764740 [1718971545.272523] [9d1d1b67da66:2472 :0] cl_basic_context.c:39 CL_BASIC DEBUG TL cuda context is not available, skipping [1718971545.272529] [9d1d1b67da66:2472 :0] cl_basic_context.c:39 CL_BASIC DEBUG TL nccl context is not available, skipping [1718971545.272537] [9d1d1b67da66:2472 :0] cl_basic_context.c:50 CL_BASIC DEBUG initialized cl context: 0x55e507df8dd0 [1718971545.272737] [9d1d1b67da66:2472 :0] ucc_context.c:518 UCC ERROR failed to allocate 281024393381230 bytes for addr storage [1718971545.272744] [9d1d1b67da66:2472 :0] ucc_context.c:726 UCC ERROR failed to exchange addresses during context creation [1718971545.272750] [9d1d1b67da66:2472 :0] cl_basic_context.c:57 CL_BASIC DEBUG finalizing cl context: 0x55e507df8dd0 [2024-06-21 12:05:45][cal][2472][Error][cal_comm_create] Error #-4 in /home/jenkins/agent/workspace/libcal/helpers/master/L0_MergeRequest/build/src/ucc_context.h:123
[1718971545.272920] [9d1d1b67da66:2472 :0] mpool.c:194 UCX DEBUG mpool stub_tasks destroyed [1718971545.272929] [9d1d1b67da66:2472 :0] tl_cuda_lib.c:41 TL_CUDA DEBUG finalizing lib object: 0x55e507703630 [1718971545.272935] [9d1d1b67da66:2472 :0] tl_mlx5_lib.c:25 TL_MLX5 DEBUG finalizing lib object: 0x55e5021c68e0 [1718971545.272942] [9d1d1b67da66:2472 :0] tl_nccl_lib.c:22 TL_NCCL DEBUG finalizing lib object: 0x55e5076f8fa0 [1718971545.272949] [9d1d1b67da66:2472 :0] tl_self_lib.c:26 TL_SELF DEBUG finalizing lib object: 0x55e507704480 [1718971545.272956] [9d1d1b67da66:2472 :0] tl_sharp_lib.c:26 TL_SHARP DEBUG finalizing lib object: 0x55e50976cf00 [1718971545.272964] [9d1d1b67da66:2472 :0] tl_shm_lib.c:30 TL_SHM DEBUG finalizing lib object: 0x55e5076f8590 [1718971545.272972] [9d1d1b67da66:2472 :0] tl_ucp_lib.c:83 TL_UCP DEBUG finalizing lib object: 0x55e507706bf0 [1718971545.272978] [9d1d1b67da66:2472 :0] cl_basic_lib.c:26 CL_BASIC DEBUG finalizing lib object: 0x55e507707340 [2024-06-21 12:05:45][cal][2472][Error][cal_comm_create] CAL Error #6 in /home/jenkins/agent/workspace/libcal/helpers/master/L0_MergeRequest/build/src/ucc_context.h:123, ucc_context_create