Open wwj-2017-1117 opened 6 months ago
The above messages are not indicative of a failure. If you are experiencing a problem, please post a more detailed description, including a complete NCCL_DEBUG=INFO
output.
I want to turn on ib sharp
This is my command to execute the program
mpirun -np 2 --allow-run-as-root --bind-to socket -x LD_LIBRARY_PATH=/nfs/ -x NCCL_UCX_TLS=rc_x,cuda_copy -x NCCL_UCX_RNDV_THRESH=0 -x UCX_MEMTYPE_CACHE=n -x NCCL_COLLNET_ENABLE=1 -x NCCL_PLUGIN_P2P=ucx -x NCCL_DEBUG=INFO -x NCCL_DEBUG_SUBSYS=NET -x NCCL_SOCKET_IFNAME=enp86s0f0 -x NCCL_IB_HCA=mlx5_10,mlx5_11,mlx5_12,mlx5_13,mlx5_14,mlx5_15,mlx5_16,mlx5_17 --host 10.21.0.1,10.21.0.2 /root/nccl-tests/build/all_reduce_perf -b 1G -e 4G -f 2 -g 8
This is the output of executing the preceding command
#
# Using devices
# Rank 0 Group 0 Pid 919753 on 10-21-0-1 device 0 [0x18] NVIDIA H100 80GB HBM3
# Rank 1 Group 0 Pid 919753 on 10-21-0-1 device 1 [0x2a] NVIDIA H100 80GB HBM3
# Rank 2 Group 0 Pid 919753 on 10-21-0-1 device 2 [0x3a] NVIDIA H100 80GB HBM3
# Rank 3 Group 0 Pid 919753 on 10-21-0-1 device 3 [0x5d] NVIDIA H100 80GB HBM3
# Rank 4 Group 0 Pid 919753 on 10-21-0-1 device 4 [0x9a] NVIDIA H100 80GB HBM3
# Rank 5 Group 0 Pid 919753 on 10-21-0-1 device 5 [0xab] NVIDIA H100 80GB HBM3
# Rank 6 Group 0 Pid 919753 on 10-21-0-1 device 6 [0xba] NVIDIA H100 80GB HBM3
# Rank 7 Group 0 Pid 919753 on 10-21-0-1 device 7 [0xdb] NVIDIA H100 80GB HBM3
# Rank 8 Group 0 Pid 1762909 on 10-21-0-2 device 0 [0x18] NVIDIA H100 80GB HBM3
# Rank 9 Group 0 Pid 1762909 on 10-21-0-2 device 1 [0x2a] NVIDIA H100 80GB HBM3
# Rank 10 Group 0 Pid 1762909 on 10-21-0-2 device 2 [0x3a] NVIDIA H100 80GB HBM3
# Rank 11 Group 0 Pid 1762909 on 10-21-0-2 device 3 [0x5d] NVIDIA H100 80GB HBM3
# Rank 12 Group 0 Pid 1762909 on 10-21-0-2 device 4 [0x9a] NVIDIA H100 80GB HBM3
# Rank 13 Group 0 Pid 1762909 on 10-21-0-2 device 5 [0xab] NVIDIA H100 80GB HBM3
# Rank 14 Group 0 Pid 1762909 on 10-21-0-2 device 6 [0xba] NVIDIA H100 80GB HBM3
# Rank 15 Group 0 Pid 1762909 on 10-21-0-2 device 7 [0xdb] NVIDIA H100 80GB HBM3
10-21-0-1:919753:919753 [0] NCCL INFO NCCL_SOCKET_IFNAME set to enp86s0f0
10-21-0-1:919753:919753 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v8 symbol.
10-21-0-1:919753:919753 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (>= v5). ncclCollNetPlugin symbols v4 and lower are not supported.
NCCL version 2.20.3+cuda12.3
10-21-0-2:1762909:1762909 [0] NCCL INFO NCCL_SOCKET_IFNAME set to enp86s0f0
10-21-0-2:1762909:1762909 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v8 symbol.
10-21-0-2:1762909:1762909 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (>= v5). ncclCollNetPlugin symbols v4 and lower are not supported.
10-21-0-1:919753:919768 [0] NCCL INFO NCCL_IB_HCA set to mlx5_10,mlx5_11,mlx5_12,mlx5_13,mlx5_14,mlx5_15,mlx5_16,mlx5_17
10-21-0-2:1762909:1762925 [1] NCCL INFO NCCL_IB_HCA set to mlx5_10,mlx5_11,mlx5_12,mlx5_13,mlx5_14,mlx5_15,mlx5_16,mlx5_17
10-21-0-1:919753:919768 [0] NCCL INFO NET/IB : Using [0]mlx5_10:1/IB [1]mlx5_11:1/IB [2]mlx5_12:1/IB [3]mlx5_13:1/IB [4]mlx5_14:1/IB [5]mlx5_15:1/IB [6]mlx5_16:1/IB [7]mlx5_17:1/IB [RO]; OOB enp86s0f0:10.21.0.1<0>
10-21-0-2:1762909:1762925 [1] NCCL INFO NET/IB : Using [0]mlx5_10:1/IB [1]mlx5_11:1/IB [2]mlx5_12:1/IB [3]mlx5_13:1/IB [4]mlx5_14:1/IB [5]mlx5_15:1/IB [6]mlx5_16:1/IB [7]mlx5_17:1/IB [RO]; OOB enp86s0f0:10.21.0.2<0>
It works, but I think we've proved that sharp failed
10-21-0-1:919753:919753 [0] NCCL INFO NCCL_SOCKET_IFNAME set to enp86s0f0
10-21-0-1:919753:919753 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v8 symbol.
10-21-0-1:919753:919753 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (>= v5). ncclCollNetPlugin symbols v4 and lower are not supported.
NCCL version 2.20.3+cuda12.3
10-21-0-2:1762909:1762909 [0] NCCL INFO NCCL_SOCKET_IFNAME set to enp86s0f0
10-21-0-2:1762909:1762909 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v8 symbol.
10-21-0-2:1762909:1762909 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (>= v5). ncclCollNetPlugin symbols v4 and lower are not supported.
You need a SHARP enabled NCCL plugin such as found here to use SHARP: nccl-rdma-sharp-plugin Normally our users pick that up from the HPCX packages which are installed inside the containers. But you can download HPCX from here too: HPC-X
Obviously, you also need a SHARP capable InfiniBand network and that needs to have been configured to support SHARP too.
You need a SHARP enabled NCCL plugin such as found here to use SHARP: nccl-rdma-sharp-plugin Normally our users pick that up from the HPCX packages which are installed inside the containers. But you can download HPCX from here too: HPC-X
Obviously, you also need a SHARP capable InfiniBand network and that needs to have been configured to support SHARP too.
yes, i had run all_reduce_pref
with SHARP by nccl 2.20.3 and HPC-X.
I ran into another problem, when I run reduce_scatter
, it keeps giving mecuda_copy_md.c:483 UCX WARN cuPointerSetAttribute(0x7f9c8d830000, SYNC_MEMOPS) error: operation not supported
, the error keeps coming up and it won't stop. when I run all_reduce_pref
, it also giving me cuda_copy_md.c:483 UCX WARN cuPointerSetAttribute(0x7f9c8d830000, SYNC_MEMOPS) error: operation not supported
, but it will stop after a few times.
The following are the outputs of sharp.arm.log
.
[May 22 02:20:08 752213][GENERAL][336070][info ] - Begin Job: job_id_external: 1516164047242882094 reservation_key: job_id_sharp: 211 absolute quota values requested. qps: 1 user_data_per_ost: 0 osts: 0 groups: 0 limited by max quota_percent: 100.00
[May 22 02:20:08 752307][GENERAL][336070][info ] - Begin Job: job_id_external: 1516164047242882094 reservation_key: job_id_sharp: 211, priority: 0 rails number: 1 requested trees: 1 ports number for all rails: 2 child_index_per_port: 1 quota_percent: 3.00 qps_percent: 0.00 is_mcast_enabled: 0 request_sat: 1 request_rmc: 0 req_feature_mask 0x0000000000000009
[May 22 02:20:08 758415][JOBS ][336070][info ] - job_id_external: 1516164047242882094 reservation_key: job_id_sharp: 211 - Allocated quota { q:2, ud:1024 o:23 g:23 } child_index_per_port: 1 tree_id: 0
[May 22 02:20:08 758750][JOBS ][336070][info ] - job_id_external: 1516164047242882094 reservation_key: job_id_sharp: 211 - Set root AN: Mellanox Technologies Aggregation Node GUID:0xfc6a1c0300588a88 on tree_id: 0
[May 22 02:20:08 758789][JOBS ][336070][info ] - job_id_external: 1516164047242882094 reservation_key: job_id_sharp: 211 - Set root AN: Mellanox Technologies Aggregation Node GUID:0xfc6a1c0300588a88 on tree_id: 512
[May 22 02:20:08 763825][GENERAL][336070][info ] - Send Begin Job reply (status: 0) for job_id_external: 1516164047242882094 reservation_key: job_id_sharp: 211
[May 22 02:20:09 695327][GENERAL][336070][info ] - Begin Job: job_id_external: 1516164018287178952 reservation_key: job_id_sharp: 212 absolute quota values requested. qps: 1 user_data_per_ost: 0 osts: 0 groups: 0 limited by max quota_percent: 100.00
[May 22 02:20:09 695400][GENERAL][336070][info ] - Begin Job: job_id_external: 1516164018287178952 reservation_key: job_id_sharp: 212, priority: 0 rails number: 1 requested trees: 1 ports number for all rails: 2 child_index_per_port: 1 quota_percent: 3.00 qps_percent: 0.00 is_mcast_enabled: 0 request_sat: 1 request_rmc: 0 req_feature_mask 0x0000000000000009
[May 22 02:20:09 701467][JOBS ][336070][info ] - job_id_external: 1516164018287178952 reservation_key: job_id_sharp: 212 - Allocated quota { q:2, ud:1024 o:23 g:23 } child_index_per_port: 1 tree_id: 0
[May 22 02:20:09 701781][JOBS ][336070][info ] - job_id_external: 1516164018287178952 reservation_key: job_id_sharp: 212 - Set root AN: Mellanox Technologies Aggregation Node GUID:0xfc6a1c03005887c8 on tree_id: 0
[May 22 02:20:09 701820][JOBS ][336070][info ] - job_id_external: 1516164018287178952 reservation_key: job_id_sharp: 212 - Set root AN: Mellanox Technologies Aggregation Node GUID:0xfc6a1c03005887c8 on tree_id: 512
[May 22 02:20:09 706530][GENERAL][336070][info ] - Send Begin Job reply (status: 0) for job_id_external: 1516164018287178952 reservation_key: job_id_sharp: 212
[May 22 02:20:10 701280][GENERAL][336070][info ] - Begin Job: job_id_external: 1516164018658590135 reservation_key: job_id_sharp: 213 absolute quota values requested. qps: 1 user_data_per_ost: 0 osts: 0 groups: 0 limited by max quota_percent: 100.00
[May 22 02:20:10 701355][GENERAL][336070][info ] - Begin Job: job_id_external: 1516164018658590135 reservation_key: job_id_sharp: 213, priority: 0 rails number: 1 requested trees: 1 ports number for all rails: 2 child_index_per_port: 1 quota_percent: 3.00 qps_percent: 0.00 is_mcast_enabled: 0 request_sat: 1 request_rmc: 0 req_feature_mask 0x0000000000000009
[May 22 02:20:10 707978][JOBS ][336070][info ] - job_id_external: 1516164018658590135 reservation_key: job_id_sharp: 213 - Allocated quota { q:2, ud:1024 o:23 g:23 } child_index_per_port: 1 tree_id: 0
[May 22 02:20:10 708316][JOBS ][336070][info ] - job_id_external: 1516164018658590135 reservation_key: job_id_sharp: 213 - Set root AN: Mellanox Technologies Aggregation Node GUID:0xfc6a1c030058ac08 on tree_id: 0
[May 22 02:20:10 708356][JOBS ][336070][info ] - job_id_external: 1516164018658590135 reservation_key: job_id_sharp: 213 - Set root AN: Mellanox Technologies Aggregation Node GUID:0xfc6a1c030058ac08 on tree_id: 512
[May 22 02:20:10 713169][GENERAL][336070][info ] - Send Begin Job reply (status: 0) for job_id_external: 1516164018658590135 reservation_key: job_id_sharp: 213
[May 22 02:20:11 545545][GENERAL][336070][info ] - Begin Job: job_id_external: 1516164009027026226 reservation_key: job_id_sharp: 214 absolute quota values requested. qps: 1 user_data_per_ost: 0 osts: 0 groups: 0 limited by max quota_percent: 100.00
[May 22 02:20:11 545621][GENERAL][336070][info ] - Begin Job: job_id_external: 1516164009027026226 reservation_key: job_id_sharp: 214, priority: 0 rails number: 1 requested trees: 1 ports number for all rails: 2 child_index_per_port: 1 quota_percent: 3.00 qps_percent: 0.00 is_mcast_enabled: 0 request_sat: 1 request_rmc: 0 req_feature_mask 0x0000000000000009
[May 22 02:20:11 551709][JOBS ][336070][info ] - job_id_external: 1516164009027026226 reservation_key: job_id_sharp: 214 - Allocated quota { q:2, ud:1024 o:23 g:23 } child_index_per_port: 1 tree_id: 0
[May 22 02:20:11 552023][JOBS ][336070][info ] - job_id_external: 1516164009027026226 reservation_key: job_id_sharp: 214 - Set root AN: Mellanox Technologies Aggregation Node GUID:0xfc6a1c030058a748 on tree_id: 0
[May 22 02:20:11 552061][JOBS ][336070][info ] - job_id_external: 1516164009027026226 reservation_key: job_id_sharp: 214 - Set root AN: Mellanox Technologies Aggregation Node GUID:0xfc6a1c030058a748 on tree_id: 512
[May 22 02:20:11 556906][GENERAL][336070][info ] - Send Begin Job reply (status: 0) for job_id_external: 1516164009027026226 reservation_key: job_id_sharp: 214
[May 22 02:20:12 307114][GENERAL][336070][info ] - Begin Job: job_id_external: 1516163991110896438 reservation_key: job_id_sharp: 215 absolute quota values requested. qps: 1 user_data_per_ost: 0 osts: 0 groups: 0 limited by max quota_percent: 100.00
[May 22 02:20:12 307187][GENERAL][336070][info ] - Begin Job: job_id_external: 1516163991110896438 reservation_key: job_id_sharp: 215, priority: 0 rails number: 1 requested trees: 1 ports number for all rails: 2 child_index_per_port: 1 quota_percent: 3.00 qps_percent: 0.00 is_mcast_enabled: 0 request_sat: 1 request_rmc: 0 req_feature_mask 0x0000000000000009
[May 22 02:20:12 313277][JOBS ][336070][info ] - job_id_external: 1516163991110896438 reservation_key: job_id_sharp: 215 - Allocated quota { q:2, ud:1024 o:23 g:23 } child_index_per_port: 1 tree_id: 0
[May 22 02:20:12 313584][JOBS ][336070][info ] - job_id_external: 1516163991110896438 reservation_key: job_id_sharp: 215 - Set root AN: Mellanox Technologies Aggregation Node GUID:0xfc6a1c0300589348 on tree_id: 0
[May 22 02:20:12 313623][JOBS ][336070][info ] - job_id_external: 1516163991110896438 reservation_key: job_id_sharp: 215 - Set root AN: Mellanox Technologies Aggregation Node GUID:0xfc6a1c0300589348 on tree_id: 512
[May 22 02:20:12 318430][GENERAL][336070][info ] - Send Begin Job reply (status: 0) for job_id_external: 1516163991110896438 reservation_key: job_id_sharp: 215
[May 22 02:20:13 064643][GENERAL][336070][info ] - Begin Job: job_id_external: 1516164010349430168 reservation_key: job_id_sharp: 216 absolute quota values requested. qps: 1 user_data_per_ost: 0 osts: 0 groups: 0 limited by max quota_percent: 100.00
[May 22 02:20:13 064722][GENERAL][336070][info ] - Begin Job: job_id_external: 1516164010349430168 reservation_key: job_id_sharp: 216, priority: 0 rails number: 1 requested trees: 1 ports number for all rails: 2 child_index_per_port: 1 quota_percent: 3.00 qps_percent: 0.00 is_mcast_enabled: 0 request_sat: 1 request_rmc: 0 req_feature_mask 0x0000000000000009
[May 22 02:20:13 070833][JOBS ][336070][info ] - job_id_external: 1516164010349430168 reservation_key: job_id_sharp: 216 - Allocated quota { q:2, ud:1024 o:23 g:23 } child_index_per_port: 1 tree_id: 0
[May 22 02:20:13 071167][JOBS ][336070][info ] - job_id_external: 1516164010349430168 reservation_key: job_id_sharp: 216 - Set root AN: Mellanox Technologies Aggregation Node GUID:0xfc6a1c03005892c8 on tree_id: 0
[May 22 02:20:13 071206][JOBS ][336070][info ] - job_id_external: 1516164010349430168 reservation_key: job_id_sharp: 216 - Set root AN: Mellanox Technologies Aggregation Node GUID:0xfc6a1c03005892c8 on tree_id: 512
[May 22 02:20:13 075601][GENERAL][336070][info ] - Send Begin Job reply (status: 0) for job_id_external: 1516164010349430168 reservation_key: job_id_sharp: 216
[May 22 02:20:13 983396][GENERAL][336070][info ] - Begin Job: job_id_external: 1516164009162948694 reservation_key: job_id_sharp: 217 absolute quota values requested. qps: 1 user_data_per_ost: 0 osts: 0 groups: 0 limited by max quota_percent: 100.00
[May 22 02:20:13 983484][GENERAL][336070][info ] - Begin Job: job_id_external: 1516164009162948694 reservation_key: job_id_sharp: 217, priority: 0 rails number: 1 requested trees: 1 ports number for all rails: 2 child_index_per_port: 1 quota_percent: 3.00 qps_percent: 0.00 is_mcast_enabled: 0 request_sat: 1 request_rmc: 0 req_feature_mask 0x0000000000000009
[May 22 02:20:13 990265][JOBS ][336070][info ] - job_id_external: 1516164009162948694 reservation_key: job_id_sharp: 217 - Allocated quota { q:2, ud:1024 o:23 g:23 } child_index_per_port: 1 tree_id: 0
[May 22 02:20:13 990581][JOBS ][336070][info ] - job_id_external: 1516164009162948694 reservation_key: job_id_sharp: 217 - Set root AN: Mellanox Technologies Aggregation Node GUID:0xfc6a1c03005886c8 on tree_id: 0
[May 22 02:20:13 990619][JOBS ][336070][info ] - job_id_external: 1516164009162948694 reservation_key: job_id_sharp: 217 - Set root AN: Mellanox Technologies Aggregation Node GUID:0xfc6a1c03005886c8 on tree_id: 512
[May 22 02:20:13 995078][GENERAL][336070][info ] - Send Begin Job reply (status: 0) for job_id_external: 1516164009162948694 reservation_key: job_id_sharp: 217
[May 22 02:20:14 800554][GENERAL][336070][info ] - Begin Job: job_id_external: 1516163989501960327 reservation_key: job_id_sharp: 218 absolute quota values requested. qps: 1 user_data_per_ost: 0 osts: 0 groups: 0 limited by max quota_percent: 100.00
[May 22 02:20:14 800634][GENERAL][336070][info ] - Begin Job: job_id_external: 1516163989501960327 reservation_key: job_id_sharp: 218, priority: 0 rails number: 1 requested trees: 1 ports number for all rails: 2 child_index_per_port: 1 quota_percent: 3.00 qps_percent: 0.00 is_mcast_enabled: 0 request_sat: 1 request_rmc: 0 req_feature_mask 0x0000000000000009
[May 22 02:20:14 806695][JOBS ][336070][info ] - job_id_external: 1516163989501960327 reservation_key: job_id_sharp: 218 - Allocated quota { q:2, ud:1024 o:23 g:23 } child_index_per_port: 1 tree_id: 0
[May 22 02:20:14 807002][JOBS ][336070][info ] - job_id_external: 1516163989501960327 reservation_key: job_id_sharp: 218 - Set root AN: Mellanox Technologies Aggregation Node GUID:0x9c059103007b9608 on tree_id: 0
[May 22 02:20:14 807041][JOBS ][336070][info ] - job_id_external: 1516163989501960327 reservation_key: job_id_sharp: 218 - Set root AN: Mellanox Technologies Aggregation Node GUID:0x9c059103007b9608 on tree_id: 512
[May 22 02:20:14 811523][GENERAL][336070][info ] - Send Begin Job reply (status: 0) for job_id_external: 1516163989501960327 reservation_key: job_id_sharp: 218
Sometimes, i will meet unable connect error, the following are the outputs of sharp.arm.log
and reduce-scatter
[May 22 02:09:51 587583][SR ][336074][info ] - Service `SHArP.AggregationManager' id 0x100002c900000002 is registered
[May 22 02:13:43 917991][FGRAPH ][336068][info ] - Start load /opt/ufm/files/log/opensm-smdb.dump file
[May 22 02:13:43 918154][MADS ][336068][info ] - -I- CsvFileStream opening file /opt/ufm/files/log/opensm-smdb.dump
[May 22 02:13:44 024345][FGRAPH ][336068][info ] - Switch port: PHRZ_A01_301-01-04 MQ9790-Spine05/P4 has become inactive
[May 22 02:13:44 024379][FGRAPH ][336068][info ] - Switch port: PHRZ_A01_401-01-06 MQ9790-Leaf01-3/P37 has become inactive
[May 22 02:13:44 029099][FGRAPH ][336068][info ] - Loading /opt/ufm/files/log/opensm-smdb.dump file ended successfully
[May 22 02:13:54 047166][FGRAPH ][336068][info ] - Start load /opt/ufm/files/log/opensm-smdb.dump file
[May 22 02:13:54 047332][MADS ][336068][info ] - -I- CsvFileStream opening file /opt/ufm/files/log/opensm-smdb.dump
[1716343707.920238] [10-21-0-1:1021882:2] cuda_copy_md.c:483 UCX WARN cuPointerSetAttribute(0xef0000000, SYNC_MEMOPS) error: operation not supported
[1716343707.921613] [10-21-0-1:1021882:2] cuda_copy_md.c:483 UCX WARN cuPointerSetAttribute(0xef0080000, SYNC_MEMOPS) error: operation not supported
[1716343707.921624] [10-21-0-1:1021882:2] cuda_copy_md.c:483 UCX WARN cuPointerSetAttribute(0xef0080000, SYNC_MEMOPS) error: operation not supported
[1716343707.921637] [10-21-0-1:1021882:2] cuda_copy_md.c:483 UCX WARN cuPointerSetAttribute(0xef0080000, SYNC_MEMOPS) error: operation not supported
[1716343707.922812] [10-21-0-1:1021882:2] cuda_copy_md.c:483 UCX WARN cuPointerSetAttribute(0xef0530000, SYNC_MEMOPS) error: operation not supported
[1716343707.922821] [10-21-0-1:1021882:2] cuda_copy_md.c:483 UCX WARN cuPointerSetAttribute(0xef0530000, SYNC_MEMOPS) error: operation not supported
[1716343707.922835] [10-21-0-1:1021882:2] cuda_copy_md.c:483 UCX WARN cuPointerSetAttribute(0xef0530000, SYNC_MEMOPS) error: operation not supported
10-21-0-1:1021882:1021931 [0] NCCL INFO GPU Direct RDMA Enabled for GPU 18000 / HCA 0 (distance 3 <= 4), read 0
10-21-0-1:1021882:1021952 [0] NCCL INFO New proxy recv connection 90 from local rank 0, transport 3
10-21-0-1:1021882:1021931 [0] NCCL INFO Connected to proxy localRank 0 -> connection 0x7f9af4003740
10-21-0-1:1021882:1021952 [0] NCCL INFO NET/IB : GPU Direct RDMA (nvidia-peermem) enabled for HCA 0 'mlx5_10
10-21-0-1:1021882:1021931 [0] NCCL INFO CollNet 00/0 : 0 [receive] via COLLNET/SHARP/0/GDRDMA
10-21-0-2:1875601:1875646 [0] NCCL INFO GPU Direct RDMA Enabled for GPU 18000 / HCA 0 (distance 3 <= 4), read 0
10-21-0-2:1875601:1875667 [0] NCCL INFO New proxy recv connection 90 from local rank 0, transport 3
10-21-0-2:1875601:1875646 [0] NCCL INFO Connected to proxy localRank 0 -> connection 0x7f1e50003740
10-21-0-2:1875601:1875667 [0] NCCL INFO NET/IB : GPU Direct RDMA (nvidia-peermem) enabled for HCA 0 'mlx5_10
10-21-0-2:1875601:1875646 [0] NCCL INFO CollNet 00/0 : 8 [receive] via COLLNET/SHARP/0/GDRDMA
10-21-0-1:1021882:1021932 [1] NCCL INFO GPU Direct RDMA Enabled for GPU 2a000 / HCA 1 (distance 3 <= 4), read 0
10-21-0-1:1021882:1021964 [1] NCCL INFO New proxy recv connection 90 from local rank 1, transport 3
10-21-0-1:1021882:1021932 [1] NCCL INFO Connected to proxy localRank 1 -> connection 0x7f87b4003740
10-21-0-1:1021882:1021964 [1] NCCL INFO NET/IB : GPU Direct RDMA (nvidia-peermem) enabled for HCA 1 'mlx5_11
10-21-0-2:1875601:1875647 [1] NCCL INFO GPU Direct RDMA Enabled for GPU 2a000 / HCA 1 (distance 3 <= 4), read 0
10-21-0-2:1875601:1875675 [1] NCCL INFO New proxy recv connection 90 from local rank 1, transport 3
10-21-0-2:1875601:1875647 [1] NCCL INFO Connected to proxy localRank 1 -> connection 0x7f1e28003740
10-21-0-2:1875601:1875675 [1] NCCL INFO NET/IB : GPU Direct RDMA (nvidia-peermem) enabled for HCA 1 'mlx5_11
10-21-0-1:1021882:1021932 [1] NCCL INFO CollNet 01/0 : 1 [receive] via COLLNET/SHARP/1/GDRDMA
10-21-0-2:1875601:1875647 [1] NCCL INFO CollNet 01/0 : 9 [receive] via COLLNET/SHARP/1/GDRDMA
10-21-0-2:1875601:1875667 [0] NCCL INFO NET/UCX: Worker address length: 60
10-21-0-1:1021882:1021952 [0] NCCL INFO NET/UCX: Worker address length: 60
10-21-0-1:1021882:1021952 [0] NCCL INFO NET/IB : GPU Direct RDMA (nvidia-peermem) enabled for HCA 0 'mlx5_10
10-21-0-2:1875601:1875667 [0] NCCL INFO NET/IB : GPU Direct RDMA (nvidia-peermem) enabled for HCA 0 'mlx5_10
[10-21-0-1:0:1021882 - context.c:657][2024-05-22 10:08:28] INFO job (ID: 1516164527714313610) resource request quota: ( osts:0 user_data_per_ost:0 max_groups:0 max_qps:1 max_group_channels:1, num_trees:1)
[10-21-0-1][May 22 10:10:38 232525][SMX ][1021985][error] - unable to connect to 10.101.21.42%6126 . Error 110 (Connection timed out)
[10-21-0-1][May 22 10:12:49 304510][SMX ][1021985][error] - unable to connect to 10.101.21.42%6126 . Error 110 (Connection timed out)
[10-21-0-1][May 22 10:12:49 304835][GENERAL][1021952][error] - failed to connect to AM - error -1 received
[10-21-0-1][May 22 10:12:49 306191][GENERAL][1021952][error] - unable to connect to AM
[10-21-0-1:0:1021882 unique id 1516164527714313610][2024-05-22 10:12:49] ERROR Failed to connect to Aggregation Manager (sharp_am) in sharp_create_job.
[10-21-0-1:0:1021882 - context.c:680][2024-05-22 10:12:49] ERROR sharp_create_job failed: Failed to connect to Aggregation Manager (sharp_am)(-53)
10-21-0-1:1021882:1021952 [0] sharp_plugin.c:320 NCCL WARN NET/IB : SHARP coll init error: Cannot create SHARP job(-11)
10-21-0-1:1021882:1021931 [0] NCCL INFO transport.cc:327 -> 2
10-21-0-2:1875601:1875667 [0] sharp_plugin.c:320 NCCL WARN NET/IB : SHARP coll init error: Cannot create SHARP job(-11)
10-21-0-2:1875601:1875646 [0] NCCL INFO transport.cc:327 -> 2
10-21-0-1:1021882:1021933 [2] NCCL INFO GPU Direct RDMA Enabled for GPU 3a000 / HCA 2 (distance 3 <= 4), read 0
10-21-0-1:1021882:1021959 [2] NCCL INFO New proxy recv connection 90 from local rank 2, transport 3
10-21-0-1:1021882:1021933 [2] NCCL INFO Connected to proxy localRank 2 -> connection 0x7f891c003740
10-21-0-1:1021882:1021959 [2] NCCL INFO NET/IB : GPU Direct RDMA (nvidia-peermem) enabled for HCA 2 'mlx5_12
10-21-0-2:1875601:1875648 [2] NCCL INFO GPU Direct RDMA Enabled for GPU 3a000 / HCA 2 (distance 3 <= 4), read 0
10-21-0-2:1875601:1875664 [2] NCCL INFO New proxy recv connection 90 from local rank 2, transport 3
10-21-0-1:1021882:1021933 [2] NCCL INFO CollNet 02/0 : 2 [receive] via COLLNET/SHARP/2/GDRDMA
10-21-0-2:1875601:1875648 [2] NCCL INFO Connected to proxy localRank 2 -> connection 0x7f1e84003740
10-21-0-2:1875601:1875664 [2] NCCL INFO NET/IB : GPU Direct RDMA (nvidia-peermem) enabled for HCA 2 'mlx5_12
10-21-0-1:1021882:1021964 [1] NCCL INFO NET/UCX: Worker address length: 60
10-21-0-2:1875601:1875675 [1] NCCL INFO NET/UCX: Worker address length: 60
10-21-0-2:1875601:1875648 [2] NCCL INFO CollNet 02/0 : 10 [receive] via COLLNET/SHARP/2/GDRDMA
10-21-0-1:1021882:1021964 [1] NCCL INFO NET/IB : GPU Direct RDMA (nvidia-peermem) enabled for HCA 1 'mlx5_11
10-21-0-2:1875601:1875675 [1] NCCL INFO NET/IB : GPU Direct RDMA (nvidia-peermem) enabled for HCA 1 'mlx5_11
[10-21-0-1:0:1021882 - context.c:657][2024-05-22 10:12:49] INFO job (ID: 1516164488794000556) resource request quota: ( osts:0 user_data_per_ost:0 max_groups:0 max_qps:1 max_group_channels:1, num_trees:1)
Supplementary explanation: 2.20.3 has added support for ReduceScatter and AllGather using IB SHARP. 10.101.21.42 is ufm node gv.cfg
#
# Copyright (c) 2013-2023 NVIDIA CORPORATION & AFFILIATES. ALL RIGHTS RESERVED.
#
# This software product is a proprietary product of Nvidia Corporation and its affiliates
# (the "Company") and all right, title, and interest in and to the software
# product, including all associated intellectual property rights, are and
# shall remain exclusively with the Company.
#
# This software product is governed by the End User License Agreement
# provided with the software product.
#
[Default]
logfile = ufm.log
[Mode]
operation_mode = infiniband
[Server]
# default | allow_other_sm | sm_only
management_mode = default
# SM Management interface
fabric_interface = ib0
# disabled (default) | enabled (configure opensm with multiple GUIDs) | ha_enabled (configure multiport SM with high availability).
multi_port_sm = disabled
# When enabling multi_port_sm, specify here the additional fabric interfaces for OpenSM conf.
# Example: ib1,ib2,ib5 (OpenSM will support the first 8 GUIDs where first GUID will
# be extracted the fabric_interface, and remaining GUIDs from additional_fabric_interfaces.
additional_fabric_interfaces =
# UFMA interfaces and Ethernet discovery interfaces.
ufma_interfaces = eth0
# Used for monitoring the management interface by UFM health.
mgmt_interface = eth0
# Whether UFM health monitors the management interface.
monitor_mgmt_interface = false
ifc_delay = 0
ifc_retries = 2
ifc_timeout = 2
start_delay = 2
# enable / disable Resource Manager.
enable_rm = false
enable_predefined_groups = true
# off | local | remote
fabric_collector_mode = off
monitoring_mode = no
monitor_for_testing = no
mon_mode_discovery_period = 60
check_interface_retry = 5
# The number of times to try if IB fabric interface is down. The duration of each retry is 1 second.
ibport_check_retries = 90
ws_address = UNDEFINED
ws_port = 8088
ws_protocol = https
# REST server will listen for requests on that port.
rest_interface = 127.0.0.1
rest_port = 8000
# Whether to authenticate web client by SSL client certificate or username/password.
client_cert_authentication = false
# UFM server receive fabric configuration from open SM plugin via that port number.
osm_plugin_port = 8081
# Port number that UFM server is listening for traps from open SM plugin.
osm_traps_listening_port = 8082
# Define the methodology UFM handles SMTraps. False for the old way(SOAP server is a thread the Model Main activate), True for the nwe way(SOAP server is a seperate subprocess communicate with ModelMain with multiprocessQueue).
use_new_traps_handler = true
# The SOAP server wakes every [osm_traps_debounce_interval] seconds and transfer the traps to the ModelMain
osm_traps_debounce_interval = 10
# The max amount of traps would be transfered from SOAP server to the UFM. if osm_traps_throttle_val is equal to 0 every time the SOAP server transfer all of the traps he got during the period.
osm_traps_throttle_val = 1000
# report_events that will determine which trap to send to ufm all/security/none.
report_events = all
# This parameter defines the polling frequency of default session in seconds, from 10 to 60
dashboard_interval = 30
# This parameter defines the polling frequency of default session in seconds, minimum default session interval is defined by 'minimum_sample_interval' parameters.
minimum_sample_interval = 10
# Minimal possible interval for monitoring session should be exposed to GUI (under site). GUI will not allow put value that lower than this.
# Server will check monitoring session and replace lower values by this.(in seconds, from 1 to 10)
minimal_collection_interval = 2
# Default interval for monitoring session should be exposed to GUI (under site). The value will appear in Monitoring session dialog.
# (in seconds, from 3 to 15).
default_collection_interval = 5
# If to issue events on internal ports.
events_on_internal_ports = no
# The events' level in case events are suppressed (the possible levels are disable_all_events, enable_critical_events, and enable_all_events).
# The entire feature can be turned off using the level "enable_all_events".
suppress_events_level = enable_critical_events
# The amount of time in seconds which events are suppressed.
suppress_events_timeout = 30
# The GUI has some polling APIs, in case of the number of these APIs failures exceeds the dobounce_counter; then the GUI will show an alert.
# in case the counter is 0; no alerts will be shown at all.
polling_errors_alerts_debounce_counter = 10
# Remove a device from the alerted group if the severity goes down to INFO.(default = yes).
auto_remove_from_alerted = yes
# Backup folder location.
backup_folder = /opt/ufm/backup
# Interval for checking SM location.
sm_check_interval = 600
# Interval for checking fabric non-optimal links.
non_opt_links_check_interval = 300
# Suppress the SM forceLinkSpeed option due to Ethernet Gateway(s) in the fabric (the default value is true).
suppress_force_link_speed_if_ipr = false
# python executable. DEFAULT = which python3.
python_exe = DEFAULT
# Whether to track core dumps or no.
track_core_dumps = no
core_dumps_directory = /tmp
# Inband firmware upgrade group size.
inband_fw_upgrade_group_size = 10
# How often will UFM health watchdog wake up and check UFM health module
ufm_health_watchdog_frequency_in_minutes = 5
# How long UFM health waiting for web response from UFM prior to restart
ufm_health_model_main_web_request_timeout = 120
# The maximal percentage of used disk space allowed to start UFM. If more than this is used, UFM
# won't start.
used_disk_percents_to_start = 96
# Timeout (in seconds) for every operation done by UFM Server to Mysql Deamon (store,update,delete).
mysql_timeout = 20
# If enable - HCAs will be grouped to node with common node description.
multinic_host_enabled = true
# If enable - HCA that has two ports and one of ports used as eth port will set number of ports to 1.
mixed_hca_mode = false
# Black list (comma separated) for expected multinic host node descriptions which
# should be avoided for multinic host creation (while host is starting up.
# For example: localhost,generic_name_1,generic_name_2)
exclude_multinic_desc = localhost
# Time interval that port did not get accounting info and should be reset.
default_hca_num_of_ports = 2
ibpm_counters_reset_interval = 300
# SMClient consumer_timeout - how many time to wait fore respond from Consumer.
sm_consumer_request_timeout = 600
# Set optional Site name to be shown in event sent to syslog.
site_name =
expose_site_name = false
run_debug_mode = no
# scp remote connect-timeout
scp_connect_timeout = 5
# Maximum number of ntework views that user can store.
max_user_views = 20
# Apache Default timeout is set to 300 seconds. For big file transfer need to change to 1000
http_proxypass_timeout = 300
# By disabling this flag the ports enable/disable will not be persistent.
# The persistent port enable/disable will be applied only for managed switches ports.
persistent_port_operation = true
# In case UFM is in restarting process, we want to make sure UFM main process is
# terminated before starting a new instance.
# In case model main is still alive we send sigkill watchdog_number_of_retries times,
# and wait for watchdog_interval_time seconds.
watchdog_number_of_retries = 5
watchdog_interval_time = 0.5
# possible values: True, False
xdr_enabled = False
# By enabling this flag the user will be able to work with forge IB anti-spoofing APIs. (Enabling tenants policy manager in SM)
tenants_policy_enabled = False
[FabricAnalysis]
# enable_on_startup - if enabled, running fabric analysis after initial delay.
enable_on_startup = true
# enable_periodically - if enabled, running fabric analysis periodically using interval or fixed time
enable_periodically = true
# initial_delay (in minutes) - the initial delay for running fabric analysis for the first time after UFM was started.
initial_delay = 5
# scheduling_mode possible values: fixed_time/interval.
scheduling_mode = interval
# unmanaged_switches_interval (in minutes) - time interval between 2 sequential runs of fabric analysis for unmanaged switches.
# If set to lower than zero feature will be disabled.
unmanaged_switches_interval = 180
# Configure ibdiagnet whether to enable warnings for high BER ports.
enable_ber_warnings = true
# ibdiagnet run at a fixed time (example: 23:17:35).
fixed_time = 23:30:00
# ibdiagnet periodic run interval for cable discovery - runs only if a links were added to the fabric(in minutes).
periodic_discovery_interval = 5
# Timeout for ibdiagnet run time (in seconds).
ibdiagnet_timeout = 300
# Discovered switch ip protocol to use: 4 for IPv4 and 6 for IPv6.
discovered_switch_ip_protocol = 4
[GarbageCollector]
# Enable garbage collector.
enable = true
# Interval in minutes for running garbage collector manually.
collect_interval = 15
[SubnetManager]
# Event plugin name(s).
event_plugin_name = osmufmpi
# Options string that would be passed to the plugin(s).
event_plugin_options = (null)
# This parameter defines if include gateway ports in partitions.conf file.
# Valid Values: auto_global (default) - include Gateway port, none - not include.
gateway_port_partitioning = none
# This parameter defines how configure MC groups in partitions.conf
# 0x0 - do not create IPoIB MC groups in advance
# 0x1(default) - create only IPv4 MC groups in advance
# 0x2 - create only IPv6 MC groups in advance
# 0x3 - create IPv4 and IPv6 MC groups in advance
default_network_type = 0x1
sm_config_file = /opt/ufm/conf/sm_definitions.ini
sm_guid_desc_mapping_file = /opt/ufm/conf/sm_guid_desc_mapping.cfg
super_switch_config_file_path = /opt/ufm/files/conf/super_switches_configuration.cfg
# If manual_qos is true, qos_conf won't be override by UFM.
manual_qos = false
# If to generate random sa_key every time opensm restarts.
randomize_sa_key = false
# Manage m_key per port.
m_key_per_port = false
global_m_key_seed = 0x0000000000000000
# Supported routing engine names.
supported_routing_engine_names = minhop,updn,dnup,file,ftree,pqft,lash,dor,torus-2QoS,sssp,dfsssp,chain,dfp,dfp2,ar_dor,ar_updn,ar_ftree,ar_torus,kdor-hc,kdor-ghc,auto
# Static SM LID, This field has 2 options:
# 1- Zero value (Default): Disable static SM lid functionality and allow the SM to run with any lid.
# Example: sm_lid = 0
# 2- Non-zero value: Enable static SM lid functionality so SM will use this lid upon UFM startup.
# Example: sm_lid = 100
sm_lid = 0
# This parameter defines if subnet merger is enabled for UFM fabric.
subnet_merger_enabled = false
[TrackConfig]
# Track config files changes.
track_config = true
# Possible options are (comma-separated) UFM, SM, SHARP, Telemetry. Or ALL for all the files.
track_conf_files = ALL
[Sharp]
# This parameter defines if sharp process will be running, or not.
# Default is false, no need to run sharp aggregation manager.
# sharp_enabled = false
sharp_enabled = true
# if set to true and sharp is enabled, sharp telemetry will be exposed in the secondary telemetry
sharp_telemetry = true
enable_sharp_allocation = false
# Parameter to set SHARP AM smx_sock_interface. If not defined - UFM fabric interface will be taken.
am_interface =
# Interval for checking if SHARP AM is responsive.
check_interval = 20
# Number of max retries for reporting the SHARP AM is not responsive.
max_retries = 3
# Optional sharp api logging levels: FATAL, ERROR, WARNING, INFO, DEBUG, TRACE.
sharp_api_log_verbosity = WARNING
# The timeout represents the duration, measured in seconds, within which a response is expected from the SHARP smx Python blocking APIs.
smx_response_timeout = 5
# The timeout represents the duration, measured in seconds, within which a response is expected from the SHARP smx Python to get the sharp_am status.
smx_sharp_status_timeout = 30
# This flag determines whether to utilize the UFM database for storing SHARP reservations or to rely on sharp_am.
# By default, it uses sharp_am for reservation storage.
use_sharp_storage = true
[Disabled]
nodes =
ports =
HA_nodes =
[dhcp]
dhcp_enable = no
# Use gPXE with connectX as default.
guid_prefix = 20:00:55:00:41:fe:80:00:00:00:00:00:00
domainname = Mellanox.Com
filename = pxelinux.0
local_dev = /dev/sda2
dns_servers = 1.1.1.1, 1.1.1.1
max_dhcp_networks = 63
[Notifications]
# Comma separated clients IPs.
snmp_listeners =
enable_snmpd = false
[MngNetwork]
# Possible values: 'full', 'limited', 'both'
default_membership = full
# Possible values: 2, 4. Unit in KB.
mtu_limit = 2
# Possible values: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15.
service_level = 0
# Possible values: 2.5, 10, 30, 5, 20, 40, 60, 80, 120, 14, 56, 112, 168, 25, 100, 200, 300. Unit in Gbps.
rate_limit = 2.5
[Logging]
# Optional logging levels: CRITICAL, ERROR, WARNING, INFO, DEBUG.
level = WARNING
smclient_level = WARNING
event_log_level = INFO
rest_log_level = INFO
authentication_service_log_level = INFO
# Additional option to print content of packet received from ibpm (possible values: yes/no, default=no).
dump_ibpm_packet = no
#syslog configuration (syslog_addr)
# For working with local syslog, set value to: /dev/log
# For working with external machine, set value to: host:port
syslog_addr = /dev/log
# The configured log_dir must have read, write and execute permission for ufmapp user (ufmapp group).
log_dir = /opt/ufm/files/log
# Main ufm log.
syslog = false
ufm_syslog = false
smclient_syslog = false
event_syslog = false
rest_syslog = false
authentication_syslog = false
syslog_level = WARNING
max_history_lines = 100000
# This section enable setting up the log files rotate policy,
# By default logrotate is running once a day by cron scheduler.
[logrotate]
# max_files specifies the number of times to rotate a file before it is deleted ( this definition will be applied to
# SM and SHARP Aggregation Manager logs, running in the scope of UFM ).
# A count of 0 (zero) means no copies are retained. A count of 15 means fifteen copies are retained (default is 15)
max_files = 15
# With max_size, the log file is rotated when the specified size is reached ( this definition will be applied to
# SM and SHARP Aggregation Manager logs, running in the scope of UFM ). Size may be specified in bytes (default),
# Kilobytes (for example: 100k), or megabytes (for example: 10M). if not specified logs will be rotated once a day.
max_size = 100M
[CSV]
write_interval = 0
ext_ports_only = no
max_files = 5
[SMsnmp]
smip = 127.0.0.1
gcommunity = public
scommunity = private
polltout = 30
poll_check_mult = 20
[SSH]
port = 22
command_timeout = 20
dsa_key_file = ~/.ssh/id_dsa
###### Definition of all default Access Points at site level ######
# Default ssh access point for all servers(hosts).
[SSH_Server]
port = 22
# Default ssh access point for all switches.
[SSH_Switch]
port = 22
# Default ipmi access point for all devices
[IPMI]
port = 623
# Default snmp access point for all devices.
[SNMP]
port = 161
# snmp poll timeout.
timeout = 10
# Default telnet access point for all devices.
[TELNET]
port = 23
timeout = 10
# Default MLNX-OS access point for all Mellanox switches.
[MLNX_OS]
protocol = https
port = 443
timeout = 10
###### End of Access Points Definition ######
[SrvMgmt]
# All in seconds.
wait_to_ping = 150
wait_to_ping_xen = 40
wait_to_ping_mlnx = 500
wait_to_ping_fit = 560
fail_time = 40
wait_to_ssh = 40
systems_poll = 180
systems_poll_init_timeout = 5
# To avoid sysinfo dump overloading and multiple writing to host.
# Switches sysinfo will be dumped to disc in json format every set in this variable.
# Sysinfo request. If set to 0 - will not be dumped, if set to 1 - will be dumped each sysinfo.
sysinfo_dump_interval = 5
sysinfo_dump_file_path = /opt/ufm/files/log/sysinfo.dump
# In case we want to msgspec to json the data. which will be quicker
use_msgspec_json = false
# Delay between 2 sequential sysinfo calls of 2 systems.(in seconds)
systems_poll_delay = 0.15
# When set to true, system info polling is activated.
systems_poll_enabled = true
systems_poll_validation = true
events_time_interval = 30
[IBPM]
# Reset mode options: Reset_Every_Poll, Reset_On_Threshold, No_Reset.
conf_file = /opt/ufm/data/records.conf
slow_interval = 30
comp_ib_slow_interval = 30
collect_hosts = true
reset_mode = Reset_Every_Poll
counter_reset_threshold = 85
ibpm_max_polling_rate = 5
max_timeouts_num = 5
ibpm_timeout_ms = 50
# Max queue size for received ibpm packets
receive_queue_size = 300000
# By default, UFM is not setting the UDP buffer size, for large scale fabrics, it
# Is recommended to increase the buffer size to 4M.
set_udp_buffer = no
# UDP buffer size
udp_buffer_size = 4194304
# Printing frequency for ibpm error packets
packets_error_frequency = 1000
[UFMAgent]
max_stats_data_size = 16384
max_filename_size = 256
max_string_size = 512
timeout = 10
default_cap_port = 1235
default_ufma_port = 6306
max_message_size = 65535
ufm_key = 0x4e20
# In case ufmagent works in ipv6 please put this multicast address FF05:0:0:0:0:0:0:15F
mcast_addr = 224.0.23.172
mcast_resend_count = 1
poll_interval = 300
# If server doesn't get discovery packets from ufma - it doesn't reset the ip address of host (in case value is "yes").
# Value option: "yes" and "no" (default value is "no")."no" means update host upon arrival of each discovery message.
keep_discovered_ip = no
ccm_poll_interval = 30
use_old_agent = no
oa_default_mcast_port = 8000
oa_listening_port = 15800
oa_mcast_addr = 224.0.0.1
ufma_mcast_ttl = 64
enable_ufma = yes
###### Device Specific Configuration ######
[server]
config_timeout = 40
###### End of Device Specific Configuration ######
[Monitoring]
history_configured = false
history_enable = false
# History mode defines UFM permissions read/write MH data.
# Valid Values: RW - read-write mode, meaning UFM is writing and retrieving data from MH Engine.
# RO - read-only mode, meaning UFM is not writing any new data to MH Engine but can read data.
mode = RW
monitoring_engine_address = 127.0.0.1
monitoring_engine_port = 8089
history_db_location = local
save_default_session_interval = 60
history_report_timeout = 600
save_data_retries = 4
save_data_timeout = 5
[SystemMonitoring]
# This section responsible to manage collecting and exposing the System utilization metrics
# 1. MIN/MAX/AVG for CPU/RAM usage percentage
# 2. Rate per second for IO read/write operations for the UFM Model Main process
# 3. REST APIs stats, # of calls for each UFM REST API, and the response time average for each API
system_monitoring_metrics = True
# This interval is to control the collecting interval for the system utilization metrics (CPU/RAM) usage percentage
# By default, Each 30 seconds, the system will collect the current values of the system metrics
system_utilization_collector_interval = 30
# This interval is to control the collecting interval for the IO operations counters read/write in bytes/count for the UFM Model Main process
# By default, Each 300 seconds, the system will collect the current values of the IO operations counters to calculate the rate
io_operations_counters_collector_interval = 300
# The collectors will keep the collected metrics for the last 24 hours by default
max_history_time = 86400
# REST APIs histogram configurations
# for the below default configurations the histogram time buckets of the REST APIs
# step_size = max_response_time_for_rest_histogram/num_of_buckets_for_rest_histogram = 1
# time buckets will be = [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0]
# Number of REST APIs response time histogram buckets
num_of_buckets_for_rest_histogram=10
# Max response time in seconds that should be represented in the REST APIs response time histogram
max_response_time_bucket_for_rest_histogram=10
# Events Gauge configurations
# This interval is to control the collecting interval for the event metrics
# By default, Each 60 seconds, the system will collect the current count of the events
events_operations_counters_collector_interval = 60
# The collectors will keep the collected metrics for the last 1 week by default
max_events_history_time = 604800
[FBCollectorManager]
# Optional logging levels: CRITICAL, ERROR, WARNING, INFO, DEBUG.
log_level = WARNING
ibdiagnet_timeout = 300.0
output_dir = /opt/ufm/tmp
keep_alive_timeout = 60.0
# A unique name identifying the site/fabric.
# Fabric collector manager will reject fabric collectors that report a site name different than the specified one.
site_name =
# Fabric collector manager will listen for requests on this port
fcm_web_server_port = 8083
[MultiTenant]
# User Type: Local, Remote.
default_user_type = Local
# The number of monitoring sessions permitted to be opened per tenant user.
max_monitoring_sessions = 4
# Number of tenant clients allowed concurrently connect to UFM server, maximum 20.
max_concurrent_clients = 20
[Client]
# Number of seconds server will wait to access from client before close client connection.
connection_timeout = 90
[Events]
# Time interval for keeping events (minimum 10 seconds, maximum 24 hours).
sending_interval = 5
# Optional units: minute, second,hour.
sending_interval_unit = minute
# If cyclic buffer is true, older events will be dropped,.
# Otherwise newer events will be dropped (if reaches max count).
cyclic_buffer = false
# Maximum number of events to be sent in one mail (buffer size).
# Range 1-1000.
max_events = 100
# Group events in mail by severity.
# Or order events by time-stamp.
group_by_severity = true
# Option to suppress link down events in case that switch is going down.
# Possible values: true/false (default = false).
suppress_link_down_events_upon_switch_down = false
action_execution_interval_unit = minute
action_execution_interval = 5
max_ufm_events = 100
max_restored_events = 50
events_persistency_enabled = false
[DailyReport]
# top_x specifies the number of results per each top x chart.
# Max number can be 20.(default is 10).
top_x = 10
# max_reports specifies the number of reports to save.
# A count of 0 (zero) means no copies are retained.(default and max is 365).
max_reports = 365
# Time interval in minutes after midnight.
# When passed mail will not be sent.
mail_send_interval = 60
log_level = INFO
daily_report_enabled = false
attach_fabric_health_report = true
fabric_health_report_timeout = 900
# Max attached file size in bytes, default is 2M (2097152 Bytes).
max_attached_file_size = 2097152
# The interval for report. end_hour is excluding.
start_hour = 0
end_hour = 24
syslog = false
# Valid arguments for send_email_method is TO/CC/BCC.
send_email_method = BCC
[UnhealthyPorts]
enable_ibdiagnet = true
log_level = INFO
syslog = false
# scheduling_mode possible values: fixed_time/interval.
# If fixed_time - ibdiagnet will run every 24 hours on the specified time - <fixed_time>.
# If interval - ibdiagnet will run first time after <start_delay> minutes from UFM startup and every <interval> hours (default scheduling mode).
scheduling_mode = interval
# First ibdiagnet start delay interval (minutes)
start_delay = 5
# ibdiagnet run interval (hours)
interval = 3
# ibdiagnet run at a fixed time (example: 23:17:35)
fixed_time = 23:00:00
# By enabling this flag all the discovered high ber ports will be marked as unhealthy automatically by UFM
high_ber_ports_auto_isolation = false
# Auto isolation mode - which type of ports should be isolated.
# Options: switch-switch,switch-host,all (default: switch-switch).
auto_isolation_mode = switch-switch
# Trigger Partial Switch ASIC Failure whenever number of unhealthy ports exceed the defined percent of the total number of the switch ports.
switch_asic_fault_threshold = 20
[Action]
max_queue_size = 1000
# Timeout in seconds for firmware rest action of unmanaged switches.
firmware_reset_timeout = 2
[Job]
# Job lifetime in hours.
job_lifetime = 720
job_timeout = 20
# Jobs garbage collector circle interval in hours.
garbage_collector_interval = 0.5
gc_squeeze_threshold = 0
gc_remove_threshold = 0
[PeriodicIbdiagnet]
# Directory location where outputs are written
periodic_ibdiagnet_dir_location = /opt/ufm/files/periodicIbdiagnet
# Minimum time between two tasks (in minutes).
minimum_task_interval = 60
# Maximum number of tasks running simultaneously.
max_optional_tasks = 5
# Maximum number of outputs to save per task (oldest gets deleted).
max_saved_outputs = 5
# Disk usage percentage threshold from which UFM will delete old tasks results.
disk_usage_threshold = 80
[Plugins]
events_forwarder_enabled = false
# Enabling this flag will show a button under the plugins management tab in UFM GUI settings to upload/pull new docker image.
upload_plugins_images_via_gui = false
# Supported plugin engines: docker, enroot
plugin_engine = docker
[Virtualization]
# By enabling this flag the UFM will discover all the virtual ports assigned for all hypervisors in the fabric.
enable = false
# Interval for checking whether any virtual ports have been changed in the fabric.
interval = 60
[Telemetry]
# Possible values:telemetry, ibpm.
telemetry_provider = telemetry
prometheus_port = 9001
receive_queue_size = 100000
history_enabled = True
# Optional values:true,false, affects sample_interval control.
manual_config = false
# Query for additional telemetry instances. format should be "http://<IP>:<port>/csv/<cset_name> http://<IP_2>:<port_2>/csv/<cset_name_2>".
additional_cset_urls =
# seconds to wait after sending SIGTERM and before using SIGKILL to the telemetry processes
telemetry_termination_timeout=5
# Parameters for secondary telemetry instance (enabled by default).
# For editing secondary telemetry instance configuration, please refer to /opt/ufm/files/conf/secondary_telemetry/launch_ibdiagnet_config.ini
# For editing secondary telemetry instance counter set, please refer to /opt/ufm//files/conf/secondary_telemetry/prometheus_configs/cset/enterprise_low_freq.cset
secondary_telemetry = true
secondary_endpoint_port = 9002
# if set to true, secondary telemetry will expose data on disabled ports
secondary_disabled_ports = true
# The telemetry can be restarted only once every x minutes (if a topology change occurred).
clx_restart_max_rate = 5
# The local ip address to bind to. default set to 0.0.0.0 for IPv4 support.
# for IPv6 please set 0:0:0:0:0:0:0:0
secondary_ip_bind_addr = 0.0.0.0
[DynamicTelemetry]
# Maximum number of simultaneous running UFM Telemetries.
max_instances = 5
# Delay time between the start of two UFM Telemetry instances, in minutes.
new_instance_delay = 5
# The time to wait before updating the discovery file of each telemetry instance, in minutes.
update_discovery_delay = 10
# Telemetry endpoint timeout, in seconds.
endpoint_timeout = 5
# Telemetry bringup tool timeout, in seconds.
bringup_timeout = 60
# Initial port for the available range of ports (range(initial_exposed_port, initial_exposed_port + max_instances)).
initial_exposed_port = 9003
[PeriodicTopologyCompare]
# Interval in which periodic task of comparing topology to master will be
# performed. Units in days.
master_periodic_interval = 1
# Interval in which UFM will check if topology is stable and suggest to
# Set master for initial setting master and for ongoing topology changes.
# Units in hours.
master_stable_period = 8
replace_master_automatically = false
max_reports_saved = 8
ibdiagnet_running_threshold = 180
[Multisubnet]
#Possible values: true, false.
multisubnet_enabled = false
#Possible role values: provider, consumer.
multisubnet_role =
# OPTIONAL PROVIDER.
# Descriptive name of the IB fabric, managed by UFM, if not specified - a random name will be generated.
multisubnet_site_name =
# The port on the provider on which it is serving the topology data.
multisubnet_topology_provider_port = 7102
# OPTIONAL CONSUMER.
# IP addresses of providers delimited by space, e.g., "10.209.36.135 10.209.36.170".
multisubnet_provider_ips =
# Ports amount and order for all optional parameters should be compatible with the
# provider IP addresses amount and order configured for "multisubnet_provider_ips".
# Topology ports delimited by space, e.g., "7102 7103".
# Default value for all the providers is 7102.
multisubnet_topology_provider_ports =
# Telemetry http endpoint ports on providers to get telemetry data delimited by space, e.g., "9001 10001".
# Default value for all the providers is 9001.
multisubnet_telemetry_endpoint_ports =
# Providers ports for the proxy to request data delimited by space, e.g., "443 444".
# Default value for all the providers is 443.
multisubnet_proxy_provider_ports =
# The port on the consumer where the aggregated proxy is listening.
multisubnet_proxy_port = 8301
# Events update interval in seconds.
multisubnet_events_polling_interval = 10
# Jobs update interval in seconds.
multisubnet_jobs_polling_interval = 10
# Reports update interval in seconds.
multisubnet_reports_polling_interval = 5
# Interval to poll changes in systems data of the providers.
multisubnet_systems_polling_interval = 30
# Check if providers are up and available.
multisubnet_ips_validation = true
# Size of chunk used by proxy while downloading sysdumps from providers.
multisubnet_chunk_size = 524288
[TopologyLevels]
enable = false
levels = server,leaf,spine,core
[UsageStatistics]
enable = false
[CPUAffinity]
# True for activating the CPU affinity feature, false for deactivating this feature, and let for each process all the available CPUs.
is_cpu_affinity_enabled = false
# The following attributes get a list of CPUs that will be the affinity for these processes. the format should be a comma-separated list of CPUs for example 0,3,7-11.
# The ModelMain should have 4 cores but not more than 5 cores.
model_main_cpu_affinity = 1-4
# The SM should have as much cores as you can give it, and needs to isolate between the ModelMain cores and the SM cores.
sm_cpu_affinity = 5-19
# SHARP can be assigned the SM affinity.
sharp_cpu_affinity = 5-19
# The telemetry should be assigned with 3-4 CPUs.
telemetry_cpu_affinity = 22-23
[AuthProxy]
# Defaults to false, but set to true to enable remote proxy authentication.
auth_proxy_enabled = false
# HTTP Header name that will contain the username.
auth_proxy_header_name = X_WEBAUTH_USER
# Set to `true` to enable auto sign up of users who do not exist in UFM DB. Defaults to `true`.
auth_proxy_auto_sign_up = true
# HTTP Header name that will contain the user role (needed in case auto_sign_up is enabled).
auth_proxy_header_role = X_WEBAUTH_ROLE
# Limit the locations where auth proxy requests can come from using a list of well known IP addresses.
# In case this field is not configured, requests will be rejected.
# This can be used to prevent users spoofing the X_WEBAUTH_USER header.
# Example `whitelist = 192.168.1.1, 192.168.1.0/24, 2001::23, 2001::0/120`.
auth_proxy_whitelist =
[AuthService]
auth_service_enabled = false
auth_service_interface = 127.0.0.1
auth_service_port = 8087
basic_auth_enabled = true
session_auth_enabled = true
token_auth_enabled = true
[AzureAuth]
azure_auth_enabled = false
# the active session will be expired after X hours, default is 8 hours
azure_session_lifetime = 8
# TENANT ID of app registration
TENANT_ID =
# Application (client) ID of app registration
CLIENT_ID =
# Application's generated client secret
CLIENT_SECRET =
[NetworkFastRecovery]
is_fast_recovery_enabled = false
# This will be supported by the Network Fast Recovery.
network_fast_recovery_conditions = SWITCH_DECISION_CREDIT_WATCHDOG,SWITCH_DECISION_RAW_BER,SWITCH_DECISION_EFFECTIVE_BER,SWITCH_DECISION_SYMBOL_BER
[RolesAccessControl]
roles_access_control_enabled = true
NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v8 symbol. NCCL INFO NET/Plugin: Failed to find ncclNetPlugin symbol (>= v5). ncclNetPlugin symbols v4 and lower are not supported. NCCL INFO cudaDriverVersion 12040 NCCL version 2.20.3+cuda12.4