ReduceScatter Bandwidth is extremely low with PXN enabled

NVIDIA / nccl

Optimized primitives for collective multi-GPU communication

Other

3.25k stars 825 forks source link

ReduceScatter Bandwidth is extremely low with PXN enabled #1114

Open yanminjia opened 11 months ago

yanminjia commented 11 months ago

I am testing the performance of ReduceScatter with 2 servers, 8 GPUs per server. When I get PXN enabled, the result is extremely bad as following graphics show.

But If the PXN is disabled, it looks fine.

I'm not sure what went wrong when the PXN is enabled. Many thanks. @sjeaugey

sjeaugey commented 11 months ago

What NCCL version is this running on?

Also can you share the dump_graph.xml produced in both cases?

yanminjia commented 11 months ago

Many thanks for your prompt response. NCCL version: 2.19.3 dump graphs as attached. Additionally, I suspect it could be caused by some kind of hardware issue. For example, if I reboot the GPU driver, possibly,this problem would be gone. But I'm not sure how to check.

graph.zip

sjeaugey commented 11 months ago

Are you sure the graph dumps are with and without PXN? They're identical (note that you may need to rename them to graph.txt for github to allow you to attach them).

yanminjia commented 11 months ago

The attached graph dump files are generated on the 2 servers (10.1.50.69 & 101.50.70) with PXN enabled. Will rename the graph files next time.

sjeaugey commented 11 months ago

Oh, ok. Can you share the graph.xml with PXN disabled?

yanminjia commented 11 months ago

dump_graph_with_no_pxn_70.txt Here is the graph.xml with PXN disabled. Thanks.

PhdShi commented 11 months ago

Oh, ok. Can you share the graph.xml with PXN disabled?

Hi, sjeaugey! I have one question: NCCL_PXN_DISABLE=1 means PXN disabled, otherwise NCCL_PXN_DISABLE=0 means PXN enabled. Right？ If I'm right, yanminjia seems to be confused with this concept.

yanminjia commented 11 months ago

Sorry for this confusion. I updated the description of this issue.

sjeaugey commented 11 months ago

Ok, well the problem remains.

It looks like with PXN disabled, then NCCL allows cross-nic communication and can use all 16 interfaces (it was using only 15 before). Not sure why using 15 was causing performance to be that low; it should just be a bit under.

In any case, NCCL 2.19 using only 15/16 interfaces is a known issue on platforms which have 16x 200G ports (instead of say 8x 400G). This should hopefully improve on NCCL 2.20.

Instead of disabling PXN, setting NCCL_CROSS_NIC=1 may do the same and avoid other side effects for e.g. alltoall communication.

yanminjia commented 11 months ago

I would like to let you know that we only identified this issue on these 2 servers so far. Therefore, I don't think it is caused by only 15 interfaces used. As you said, it should be just a litter bit under.

It is a good news that all 16 interfaces would be used on platforms having 16x200G ports.

Anyway, will try NCCL_CROSS_NIC=1 when the devices are available.

yanminjia commented 11 months ago

Many thanks, Sylvain. It does work by setting NCCL_CROSS_NIC=1.

yanminjia commented 11 months ago

Finally, I identified a physical link connecting the 2 RNICs between these 2 servers was really low as shown in below perftest display. That should be the root cause of this problem. But I don't understand why NCCL_CROSS_NIC=1 could bypass this problematic link in this case. Thanks.