aws / aws-ofi-nccl

This is a plugin which lets EC2 developers use libfabric as network provider while running NCCL applications.
Apache License 2.0
129 stars 51 forks source link

tuner: prefer NVLSTREE on 16 nodes at 4GB #424

Closed AmedeoSapio closed 1 month ago

AmedeoSapio commented 1 month ago

Our tests showed that the tuner is currently making the wrong decision at 4GB on 16 P5s, which caused a regression. This is a temporary workaround to force NVLSTREE while we work to make the model more accurate.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.