Open yanminjia opened 11 months ago
What NCCL version is this running on?
Also can you share the dump_graph.xml produced in both cases?
Many thanks for your prompt response. NCCL version: 2.19.3 dump graphs as attached. Additionally, I suspect it could be caused by some kind of hardware issue. For example, if I reboot the GPU driver, possibly,this problem would be gone. But I'm not sure how to check.
Are you sure the graph dumps are with and without PXN? They're identical (note that you may need to rename them to graph.txt for github to allow you to attach them).
The attached graph dump files are generated on the 2 servers (10.1.50.69 & 101.50.70) with PXN enabled. Will rename the graph files next time.
Oh, ok. Can you share the graph.xml with PXN disabled?
dump_graph_with_no_pxn_70.txt Here is the graph.xml with PXN disabled. Thanks.
Oh, ok. Can you share the graph.xml with PXN disabled?
Hi, sjeaugey! I have one question: NCCL_PXN_DISABLE=1 means PXN disabled, otherwise NCCL_PXN_DISABLE=0 means PXN enabled. Right? If I'm right, yanminjia seems to be confused with this concept.
Sorry for this confusion. I updated the description of this issue.
Ok, well the problem remains.
It looks like with PXN disabled, then NCCL allows cross-nic communication and can use all 16 interfaces (it was using only 15 before). Not sure why using 15 was causing performance to be that low; it should just be a bit under.
In any case, NCCL 2.19 using only 15/16 interfaces is a known issue on platforms which have 16x 200G ports (instead of say 8x 400G). This should hopefully improve on NCCL 2.20.
Instead of disabling PXN, setting NCCL_CROSS_NIC=1
may do the same and avoid other side effects for e.g. alltoall communication.
I would like to let you know that we only identified this issue on these 2 servers so far. Therefore, I don't think it is caused by only 15 interfaces used. As you said, it should be just a litter bit under.
It is a good news that all 16 interfaces would be used on platforms having 16x200G ports.
Anyway, will try NCCL_CROSS_NIC=1 when the devices are available.
Many thanks, Sylvain. It does work by setting NCCL_CROSS_NIC=1.
Finally, I identified a physical link connecting the 2 RNICs between these 2 servers was really low as shown in below perftest display. That should be the root cause of this problem. But I don't understand why NCCL_CROSS_NIC=1 could bypass this problematic link in this case. Thanks.
I am testing the performance of ReduceScatter with 2 servers, 8 GPUs per server. When I get PXN enabled, the result is extremely bad as following graphics show.
But If the PXN is disabled, it looks fine.
I'm not sure what went wrong when the PXN is enabled. Many thanks. @sjeaugey