aws / aws-ofi-nccl

This is a plugin which lets EC2 developers use libfabric as network provider while running NCCL applications.
Apache License 2.0
147 stars 56 forks source link

aws: Override libfabric link_attr for certain platforms #686

Closed rajachan closed 2 weeks ago

rajachan commented 2 weeks ago

As noted in the comment, until the relevant libfabric and EFA driver changes to dynamically query the speed for newer generation devices are more widely deployed, this commit overrides the link_attribute so that we report the correct property to NCCL. For P5en, this gets scaled up to 400Gbps for the aggregated NIC that we report to NCCL, so NCCL's topo discovery and graph generation work as expected.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.