aws / aws-ofi-nccl

This is a plugin which lets EC2 developers use libfabric as network provider while running NCCL applications.
Apache License 2.0
129 stars 51 forks source link

tuner: better ring rank/msize binning #422

Closed aws-nslick closed 6 days ago

aws-nslick commented 1 month ago

the switchpoint between nvlstree and ring is given by the ratio of the message size to the number of ranks. Previously, we just returned INFINITY when crossing this boundary. Try to improve this to be more accurate.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

a-szegel commented 1 month ago

bot:aws:retest

aws-nslick commented 1 month ago

bot:aws:retest

aws-nslick commented 1 month ago

bot:aws:retest

a-szegel commented 1 month ago

Please merge master.

aws-nslick commented 2 weeks ago

All changes have been made and implemented as fixups, where the cost/decision changes can be viewed per commit. Once threads are resolved, I will squash them back into the commit and we can proceed.

aws-nslick commented 1 week ago

bot:aws:retest