Mellanox / nccl-rdma-sharp-plugins

RDMA and SHARP plugins for nccl library
BSD 3-Clause "New" or "Revised" License
145 stars 32 forks source link

Using SHARP failed which sharp_coll_comm_init running failed. #151

Open shanleo1986 opened 3 months ago

shanleo1986 commented 3 months ago

Hi developer, I have built the SHARP env, and the sharp plugin has been loaded successfylly. When run this function sharp_coll_comm_init , it return error, so finally the nccl use the P2P NET. Can you give me some help to analysis this issue, thank you!

The following is the error log: [C25L18:0:24972 - context.c:702] INFO job (ID: 1201360188720575732) resource request quota: ( osts:0 user_data_per_ost:0 max_groups:0 max_qps:1 max_group_channels:1, num_trees:1) [C25L18:0:24972 - context.c:895] INFO sharp_job_id:12 resv_key: tree_type:LLT tree_idx:0 treeID:0x1 caps:0x26 quota:(osts:23 user_data_per_ost:1024 max_groups:23 max_qps:1 max_group_channels:1) [C25L18:0:24972 - context.c:899] INFO sharp_job_id:12 tree_type:SAT tree_idx:1 treeID:0x40 caps:0x36 C25L19:19373:19491 [3] NCCL INFO Sharp rank 1/2 initialized on mlx5_5:1 C25L18:24972:25066 [3] NCCL INFO Sharp rank 0/2 initialized on mlx5_5:1 [C25L18:0:24972 - comm.c:374] ERROR Failed to lock SAT tree(ID:0x40 ret:0x4) [C25L19:1:19373 - comm.c:370] ERROR Failed to lock SAT tree(ID:0x40 ret:0x4)

C25L19:19373:19491 [3] sharp_plugin.c:302 NCCL WARN SHARP group create: Streaming Tree lock failed (-18) C25L18:24972:25066 [3] sharp_plugin.c:302 NCCL WARN SHARP group create: Streaming Tree lock failed (-18)

AddyLaddy commented 3 months ago

Do your IB network switches support SHARP? Have you enabled the SHARP feature in the UFM/OpenSM configuration?

shanleo1986 commented 3 months ago

Do your IB network switches support SHARP? Have you enabled the SHARP feature in the UFM/OpenSM configuration?

Yes, my IB network switch does support SHARP, and the HCA card is CX7: [root@C25L18 shanxs]# lspci | grep -i mell 03:00.0 Infiniband controller: Mellanox Technologies MT28861 23:00.0 Infiniband controller: Mellanox Technologies MT28861 44:00.0 Infiniband controller: Mellanox Technologies MT28861 64:00.0 Infiniband controller: Mellanox Technologies MT28861 [root@C25L18 shanxs]#

And the sharp_am srevice is already enabled on the opensm master:

[root@C27L1 ~]# ps aux |grep opensm root 8373 0.0 0.0 217848 1076 pts/0 R+ 16:59 0:00 grep --color=auto opensm root 18074 0.2 0.0 6932984 24728 ? Sl Mar17 2:33 /usr/sbin/opensm --daemon [root@C27L1 ~]# service sharp_am status Redirecting to /bin/systemctl status sharp_am.service ● sharp_am.service - SHARP Aggregation Manager (sharp_am). Version: 3.0.0 Loaded: loaded (/etc/systemd/system/sharp_am.service; enabled; vendor preset: enabled) Drop-In: /etc/systemd/system/sharp_am.service.d └─Service.conf Active: active (running) since Sun 2024-03-17 22:07:23 UTC; 18h ago Main PID: 18222 (sharp_am) Tasks: 40 (limit: 26213) Memory: 31.7M CGroup: /system.slice/sharp_am.service └─18222 /opt/hpc/software/mpi/hpcx/v2.12.0/sharp/bin/sharp_am -O -/opt/hpc/software/mpi/hpcx/v2.12.0/sharp/conf/sharp_am.cfg

Mar 17 22:07:23 C27L1 sharp_am[18222]: Package: sharp-rc3 Mar 17 22:07:23 C27L1 sharp_am[18222]: Version: 3.0.0 Mar 17 22:07:23 C27L1 sharp_am[18222]: Build Date: Jul 20 2022 Mar 17 22:07:23 C27L1 sharp_am[18222]: Last commit: cf51a32 Mar 17 22:07:23 C27L1 sharp_am[18222]: IBIS last commit: 3c41903 Mar 17 22:07:23 C27L1 sharp_am[18222]: Log verbosity: 2 Mar 17 22:07:23 C27L1 sharp_am[18222]: Syslog verbosity: 1 Mar 17 22:07:23 C27L1 sharp_am[18222]: Command line: /opt/hpc/software/mpi/hpcx/v2.12.0/sharp/bin/sharp_am -O -/opt/hpc/software/mpi/hpcx/v2.12.0/sharp/conf/sharp_am.cfg Mar 17 22:07:24 C27L1 sharp_am[18222]: There is not a single tree that spans over all leafs. Mar 17 22:07:24 C27L1 sharp_am[18222]: Built 2 trees. [root@C27L1 ~]# [root@C27L1 ~]#

Lzhang-hub commented 1 month ago

@shanleo1986 I have same issue, Have you solved? Beside, I run nccl-test with sharp is normal, but get this error when run megatron-lm gpt3 example.

liuxingbo12138 commented 4 weeks ago

@shanleo1986 I have same issue, Have you solved? Beside, I run nccl-test with sharp is normal, but get this error when run megatron-lm gpt3 example.

image me too, i use ngc to run megatrom-llm with sharp failed, do you reslove it?

Lzhang-hub commented 2 weeks ago

@liuxingbo12138 try add use_sharp=True in initialize_model_parallel