NVIDIA / nccl

Optimized primitives for collective multi-GPU communication
Other
3.2k stars 807 forks source link

NCCL_IB_HCA suport array, why NCCL_IB_GID_INDEX not support array #890

Open firemiles opened 1 year ago

firemiles commented 1 year ago

Hello,

I'm new to nccl. I have use nccl in kubernetes + rdma-device-plugin.

I add multiple HCAs to Pod like NCCL_IB_HCA=mlx5_1:1,mlx5_2:1。and we also want to use NCCL_IB_GID_INDEX=3 for mlx5_1:1 and NCCL_IB_GID_INDEX=6 for mlx5_2:1。

any advice if I want to do this?

sjeaugey commented 1 year ago

That's not something NCCL supports indeed. It's the first time we get such a request. Shouldn't be too hard to implement, but still requires some work.

firemiles commented 1 year ago

That's not something NCCL supports indeed. It's the first time we get such a request. Shouldn't be too hard to implement, but still requires some work.

Thanks for your reply, we are using macvlan to virtualize the rdma network card in the container scenario. When a container is inserted into multiple macvlans, we encounter the problem of inconsistent nccl IB GID, which limits the way our rdma network card is used.

I don't know if there is a plan to support this approach

sjeaugey commented 1 year ago

Just to make sure I understand the problem, in your container your NIC has two IP addresses (which map to the same physical NIC) and each interface has its own GID index. So inside the container you have both GID Index 3 and GID Index 6, which both work, but map to different VLANs, and you want to select GID 6 because you need to communicate with another GPU using that second VLAN.

Is that right?

firemiles commented 1 year ago

Just to make sure I understand the problem, in your container your NIC has two IP addresses (which map to the same physical NIC) and each interface has its own GID index. So inside the container you have both GID Index 3 and GID Index 6, which both work, but map to different VLANs, and you want to select GID 6 because you need to communicate with another GPU using that second VLAN.

Is that right?

Not that.

A node has 4 RDMA network cards and 8 GPUs, and each RDMA network card allows the creation of several macvlan to solve the problem of insufficient RDMA. When Pod1 adds 4 GPUs, we try to add two RDMA macvlan eth1 and eth2 to it. The corresponding masters are mlx5_1 and mlx5_2, and the GIDs are 3 and 6 respectively. I want Pod1 to use eth1 and eth2 for GDR to increase the bandwidth when running NCCL.

I don't know if this is the correct way to use it. Is there any suggestions can help me

sjeaugey commented 1 year ago

creation of several macvlan to solve the problem of insufficient RDMA.

What is the problem of "insufficient RDMA"?

If you have 8 GPUs and 4 NICs, I would assume we have 2 GPUs per NIC. So when you run on 4 GPUs, you should have 2 dedicated NICs.

Now, I'm not sure why the two NICs would be treated differently and have different GID Indexes.

firemiles commented 1 year ago

creation of several macvlan to solve the problem of insufficient RDMA.

What is the problem of "insufficient RDMA"?

If you have 8 GPUs and 4 NICs, I would assume we have 2 GPUs per NIC. So when you run on 4 GPUs, you should have 2 dedicated NICs.

Now, I'm not sure why the two NICs would be treated differently and have different GID Indexes.

We have fragmentation due to GPU scheduling. For example, some machines have one GPU left unused, which will have 1 GPU + 1 RDMA pods, and the RDMA network card will be insufficient at this time. When two Macvlans are created on one RDMA NIC, the GID indexes assigned by the two Macvlans are different, so we cannot always guarantee that two NICs in Pod have the same GID Indexes.

sjeaugey commented 1 year ago

Ah, I see. By "RDMA" you meant NIC (network card). So on each NIC you create 2 maclvan, hence 2 GID Indexes, one for the first GPU and the other for the second GPU in case they're not in the same container.

But how would you set NCCL_IB_HCA then? Would you want to ask NCCL to use both GID Index 3 and 6, which means we'd need to support something like NCCL_IB_HCA=mlx5_0:1/3,mlx5_0:1/6,...?

firemiles commented 1 year ago

Ah, I see. By "RDMA" you meant NIC (network card). So on each NIC you create 2 maclvan, hence 2 GID Indexes, one for the first GPU and the other for the second GPU in case they're not in the same container.

But how would you set NCCL_IB_HCA then? Would you want to ask NCCL to use both GID Index 3 and 6, which means we'd need to support something like NCCL_IB_HCA=mlx5_0:1/3,mlx5_0:1/6,...?

you are right. we want to support like NCCL_IB_HCA=mlx5_0:1/3,mlx5_1:1/6 or NCCL_IB_HCA=mlx5_0:1,mlx5_1:1 NCCL_IB_GID_INDEX=3,6

sjeaugey commented 1 year ago

Ok -- that's a bit more complicated. We'll need to make our GID selection code way more complex. We intended to do so at some point, but I can't give guarantees as to when that will land.

firemiles commented 1 year ago

Ok -- that's a bit more complicated. We'll need to make our GID selection code way more complex. We intended to do so at some point, but I can't give guarantees as to when that will land.

Thanks for your reply. Glad you are willing to accept this feature,I will continue to monitor the progress.

sjeaugey commented 8 months ago

We have just pushed a branch allowing for easier GID selection. Please check https://github.com/NVIDIA/nccl/commit/fba92421939a343cb39c6c485eb1044b0a691800.

Kyrie336 commented 7 months ago

We have just pushed a branch allowing for easier GID selection. Please check fba9242.

@sjeaugey Hi, does this commit add the ability to automatically select GIDs or the ability to select multiple GIDs, or both?

sjeaugey commented 7 months ago

The commit allows NCCL to automatically pick the GID index which is e.g. RoCEv2 + IPv4. That GID index can be different on each node.

Kyrie336 commented 7 months ago

The commit allows NCCL to automatically pick the GID index which is e.g. RoCEv2 + IPv4. That GID index can be different on each node.

That means that only one GID can still be selected, but it is detected automatically. For example, if you have two IB network cards with different GIDs, one is 3, one is 6. The automatic detection result contains only one GID. Do I understand that right


DEV     PORT    INDEX   GID                                     IPv4            VER     DEV
---     ----    -----   ---                                     ------------    ---     ---
mlx5_0  1       3       0000:0000:0000:0000:0000:ffff:c0a8:c848 192.168.200.72  v2      ens2np0
mlx5_1  1       6       0000:0000:0000:0000:0000:ffff:c0a8:0014 192.168.0.20    v2      ens1np0
sjeaugey commented 7 months ago

Yes, however IIRC you can also select the subnet you want to use if you need further filtering of interfaces.

Kyrie336 commented 7 months ago

Yes, however IIRC you can also select the subnet you want to use if you need further filtering of interfaces.

I see. Thanks for your reply

limu713 commented 1 month ago

we have the same problem and solved it by modifying net_ib.cc.

https://github.com/NVIDIA/nccl/pull/1427/commits/dc445aa5dacf55e16db7bd82585f454ad4cb85c6