alltoall_perf: each rank is only sending to half of the other ranks

russilwvong commented 1 month ago

We're seeing rather mysterious behavior (using nccl 2.18.3-1+cuda12.1). We have two servers with four GPUs each, each GPU with one NIC. When we run an all-to-all test across all eight GPUs, what we expect to see is that each GPU sends 1/8 of the job size to each of the other 7 GPUs.

What we see instead when we look at the packets on the wire is that each GPU is only talking to three of the other GPUs, one on the same server and two on the other server. (The pattern is 1/4/5/8 and 2/3/6/7.)

We've tried various environment variables -

NCCL_P2P_DISABLE
NCCL_SHM_DISABLE
NCCL_NVB_DISABLE
NCCL_PXN_DISABLE

None of them appears to have any effect on this pattern of communication.

We're currently digging into the source code to try to figure out how nccl decides what NIC to use when transferring data. Any hints would be welcome.

sjeaugey commented 1 month ago

That would be heavily dependent on the PCI topology of your systems. I can't comment without a precise description or a NCCL topology dump (NCCL_TOPO_DUMP_FILE=system.xml)

russilwvong commented 1 month ago

Thanks, Sylvain. Interesting. I've attached a NCCL topology dump. system.xml.txt

sjeaugey commented 1 month ago

Thanks. It seems the GPU and NICs are attached directly to the CPU; so the GPU-NIC association isn't really direct. Also because there is no PCI direct connection between NICs and GPUs, PXN wouldn't be used.

So I would expect each GPU would pick one NIC (or maybe the two that are local) and send their data to all the others using that NIC. I don't see how the alltoall could complete otherwise.

russilwvong commented 1 month ago

Hmm. Okay, so a GPU picks one of two local NICs when sending outgoing data. Can I ask how it determines what receiving NIC to send to, in order to reach a destination GPU?

sjeaugey commented 1 month ago

When a GPU picks a NIC to receive from, it will get the handle of that NIC and pass it to the other side which will connect to it.

Are you using RoCE? If so, how did you configure the IP addresses on the different interfaces? Did you use one IP subnet per NIC or did you put all of them in the same subnet?

russilwvong commented 1 month ago

Yes, we're using RoCEv2. Each NIC has its own subnet - 32.0.1.2/24, 32.0.2.2/24, and so on.

Honestly, this discussion has already been quite illuminating for me. I had been assuming that each GPU would always use the same NIC to send and receive packets, and it sounds like that's not the case at all. The NCCL library chooses which NIC to use to send to a particular destination (and it looks like with multiple channels it may use multiple NICs?), and similarly chooses which NIC to use when receiving from a particular source.

So we may see this pattern of traffic - on the wire, half the NICs are talking to each other, and the other half are also talking to each other - if the source and destination GPUs are always picking source and destination NICs from the same half.

On Fri., Jun. 14, 2024, 2:20 a.m. Sylvain Jeaugey, @.***> wrote:

When a GPU picks a NIC to receive from, it will get the handle of that NIC and pass it to the other side which will connect to it.

Are you using RoCE? If so, how did you configure the IP addresses on the different interfaces? Did you use one IP subnet per NIC or did you put all of them in the same subnet?

— Reply to this email directly, view it on GitHub https://github.com/NVIDIA/nccl-tests/issues/224#issuecomment-2167614080, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJI7SH2ZXN67A5Q6J5AW5H3ZHKYU7AVCNFSM6AAAAABJJENCDGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNRXGYYTIMBYGA . You are receiving this because you authored the thread.Message ID: @.***>

sjeaugey commented 1 month ago

Each NIC has its own subnet - 32.0.1.2/24, 32.0.2.2/24, and so on.

Thanks for confirming.

I had been assuming that each GPU would always use the same NIC to send and receive packets, and it sounds like that's not the case at all.

As a general design, a GPU will use all NICs which are the most local in the topology, and round-robin on them based on various factors. If two GPUs share two NICs, then each GPU should start with a different NIC, then round-robin.

russilwvong commented 1 month ago

If two GPUs share two NICs, then each GPU should start with a different NIC, then round-robin.

Hmm. Okay, say GPUs A1 and A2 are sharing two NICs, and GPUs B1 and B2 are sharing two NICs.

Then when A1 sends data to B1 and then to B2, B1 and B2 will use two different NICs to receive the data. At the same time, A1 will use one NIC to send to B1. I guess it must then round-robin to its other NIC to send to B2.

We may then end up with a pattern where rank pairs which are both odd-numbered or both even-numbered (like A1 and B1) always talk to each other using half the NICs, and rank pairs where one side is odd and the other is even (like A1 and B2) always use the other half of the NICs.

But I guess there may be other factors causing round-robin which would break up the pattern, or everyone with a similar setup would see this pattern all the time.

—

Reply to this email directly, view it on GitHub https://github.com/NVIDIA/nccl-tests/issues/224#issuecomment-2169150176, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJI7SH2CVWEICE3MW5WFH7LZHPKNXAVCNFSM6AAAAABJJENCDGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNRZGE2TAMJXGY . You are receiving this because you authored the thread.Message ID: @.***>

russilwvong commented 1 month ago

Hmm. Actually, there must be something I'm missing. Say that we have four NICs on one server, with IP addresses 32.0.1.2, 32.0.2.2, 32.0.3.2, and 32.0.4.2, and with four GPUs, two of them (A1 and A2) sharing 32.0.1.2 and 32.0.2.2, and two of them (A3 and A4) sharing 32.0.3.2 and 32.0.4.2.

Similarly we have a second server with four GPUs B1 to B4, and with four NICs with IP addresses 32.0.5.2 to 32.0.8.2.

If we run an all-to-all collective with all eight GPUs, we see that 32.0.1.2 is sending to 32.0.4.2, and 32.0.2.2 is sending to 32.0.3.2. So A1 can send to A3 and A4 (which could happen using either 32.0.1.2 or 32.0.2.2). Similarly A2 can send to A3 and A4 (using either 32.0.2.2 or 32.0.1.2).

But how does A1 send to A2? We don't see any packets going from 32.0.1.2 to 32.0.2.2, or from 32.0.2.2 to 32.0.1.2.

We've disabled NVLink, P2P, and shared memory. But maybe there's something I've missed.

russilwvong commented 1 month ago

Going through the trace logs, it looks like A1 is sending to A2, sometimes with A1 sending on NIC 0 (NET/IB/0) and A2 receiving on NIC 1 (NET/IB/1), and sometimes with A1 sending on NIC 1 and A2 receiving on NIC 0 - but no packets appear on the wire.

lambda-server-1:13354:13387 [1] NCCL INFO Channel 00/0 : 0[0] -> 1[1] [receive] via NET/IB/1/GDRDMA lambda-server-1:13353:13386 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[1] [send] via NET/IB/0/GDRDMA lambda-server-1:13354:13387 [1] NCCL INFO Channel 01/0 : 0[0] -> 1[1] [receive] via NET/IB/0/GDRDMA lambda-server-1:13353:13386 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[1] [send] via NET/IB/1/GDRDMA lambda-server-1:13354:13387 [1] NCCL INFO Channel 02/0 : 0[0] -> 1[1] [receive] via NET/IB/1/GDRDMA lambda-server-1:13353:13386 [0] NCCL INFO Channel 02/0 : 0[0] -> 1[1] [send] via NET/IB/0/GDRDMA lambda-server-1:13354:13387 [1] NCCL INFO Channel 03/0 : 0[0] -> 1[1] [receive] via NET/IB/0/GDRDMA lambda-server-1:13353:13386 [0] NCCL INFO Channel 03/0 : 0[0] -> 1[1] [send] via NET/IB/1/GDRDMA lambda-server-1:13353:13422 [0] NCCL INFO Channel 02/1 : 0[0] -> 1[1] [send] via NET/IB/0/GDRDMA/Shared lambda-server-1:13353:13422 [0] NCCL INFO Channel 03/1 : 0[0] -> 1[1] [send] via NET/IB/1/GDRDMA/Shared lambda-server-1:13354:13425 [1] NCCL INFO Channel 02/1 : 0[0] -> 1[1] [receive] via NET/IB/0/GDRDMA/Shared lambda-server-1:13354:13425 [1] NCCL INFO Channel 03/1 : 0[0] -> 1[1] [receive] via NET/IB/1/GDRDMA/Shared

The last four lines appear odd: for channel 02/1, it looks like A1 is sending via NIC 0 and A2 is receiving on the same NIC, NIC 0 (!). Same for channel 03/1.

lambda-server-1:13353:13422 [0] NCCL INFO Channel 02/1 : 0[0] -> 1[1] [send] via NET/IB/0/GDRDMA/Shared lambda-server-1:13353:13422 [0] NCCL INFO Channel 03/1 : 0[0] -> 1[1] [send] via NET/IB/1/GDRDMA/Shared lambda-server-1:13354:13425 [1] NCCL INFO Channel 02/1 : 0[0] -> 1[1] [receive] via NET/IB/0/GDRDMA/Shared lambda-server-1:13354:13425 [1] NCCL INFO Channel 03/1 : 0[0] -> 1[1] [receive] via NET/IB/1/GDRDMA/Shared

sjeaugey commented 1 month ago

it looks like A1 is sending to A2, sometimes with A1 sending on NIC 0 (NET/IB/0) and A2 receiving on NIC 1 (NET/IB/1), and sometimes with A1 sending on NIC 1 and A2 receiving on NIC 0 - but no packets appear on the wire.

I'm not expert enough to comment on that. RoCE relies on the linux kernel's rooting table and ARP to know how to reach a destination. There could be optimizations/bugs which would end up with this kind of behavior. I don't know how to debug that though.

it looks like A1 is sending via NIC 0 and A2 is receiving on the same NIC, NIC 0

Why is that odd? Round-robin may end up with the same NIC for both, which will just go through NIC loopback and not even reach the wire.

russilwvong commented 1 month ago

Why is that odd? Round-robin may end up with the same NIC for both, which will just go through NIC loopback and not even reach the wire.

Very interesting, I hadn't realized this earlier. Thanks for taking the time to respond, this has been very illuminating.

Can I ask, what's the difference between Channel 00/0 and Channel 00/1? Is Channel 00/1 used for doing the actual data transfer?

We ran some tests with different versions of NCCL (2.18.1, 2.19.1, 2.20.3, 2.21.5). 2.18 is the only one which exhibits this behavior. For all the other versions, each rank talks to all other ranks in an eight-rank all-to-all collective.

Comparing the log files for 2.18.1 and 2.21.5, using a four-rank collective on a single server (to cut down the amount of data to look at), and focusing only on Channel XX/1 logs:

With 2.18.1, adjacent GPUs (0 and 1, 2 and 3) always select the same NIC for sending and receiving. So there's no packets on the wire. With 2.21.5, adjacent GPUs always select different NICs.
With 2.18.1, focusing on rank 0, GPUs which aren't adjacent always select one of the following combinations for sending and receiving: NIC 0 and NIC 2, or NIC 1 and NIC 3. So NIC 0 and NIC 2 never talk to NIC 1 or NIC 3. Presumably this is just an artifact of the way the first NIC is selected by each rank, and then round-robin means that the pattern is consistent. With 2.21.5, we see all four combinations - 1 and 2, 0 and 3, 0 and 2, 1 and 3.

So my guess at this point is

NCCL uses Channel XX/1 connections to transfer data.
On the servers where this test is running, each pair of GPUs shares a pair of NICs.
For NCCL 2.18 (but not subsequent versions), both GPUs in a pair end up choosing the same NIC to start with, for sending and for receiving. The result is that (a) adjacent GPUs never send any packets to each other on the wire, and (b) non-adjacent GPUs always transfer data using either the first NIC from a pair, or the second NIC. The first NIC in a pair never talks to the second NIC in the same pair or any other pair.

I don't suppose there's an option to tell NCCL to always assign a specific NIC to a specific GPU when sending or receiving?

sjeaugey commented 1 month ago

what's the difference between Channel 00/0 and Channel 00/1?

The second number is the connection index. connIndex 1 uses shared buffers and is used for send/recv operations, while connIndex 0 uses dedicated buffers and is used for Rings and Trees.

We ran some tests with different versions of NCCL (2.18.1, 2.19.1, 2.20.3, 2.21.5). 2.18 is the only one which exhibits this behavior.

Indeed at some point we changed the channel selection logic to use NICs in a more efficient manner and improve the round-robin. I can't recall exactly which version did that, but it could have been 2.19.

I don't suppose there's an option to tell NCCL to always assign a specific NIC to a specific GPU when sending or receiving?

Not really. Unless you want to cook up a topology file which declares that each GPU only has one local NIC. But that can cause trouble to close the rings, so it may have adverse consequences.

russilwvong commented 1 month ago

The second number is the connection index. connIndex 1 uses shared buffers and is used for send/recv operations, while connIndex 0 uses dedicated buffers and is used for Rings and Trees.

Great, thanks for confirming. And of course, thank you for all your work on the nccl library!

NVIDIA / nccl-tests

alltoall_perf: each rank is only sending to half of the other ranks #224