As noted in the comment, until the relevant libfabric and EFA driver changes to dynamically query the speed for newer generation devices are more widely deployed, this commit overrides the link_attribute so that we report the correct property to NCCL. For P5en, this gets scaled up to 400Gbps for the aggregated NIC that we report to NCCL, so NCCL's topo discovery and graph generation work as expected.
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
As noted in the comment, until the relevant libfabric and EFA driver changes to dynamically query the speed for newer generation devices are more widely deployed, this commit overrides the link_attribute so that we report the correct property to NCCL. For P5en, this gets scaled up to 400Gbps for the aggregated NIC that we report to NCCL, so NCCL's topo discovery and graph generation work as expected.
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.