aws / aws-ofi-nccl

This is a plugin which lets EC2 developers use libfabric as network provider while running NCCL applications.
Apache License 2.0
129 stars 51 forks source link

Introduce a memory registration cache for the net plugin #447

Open rajachan opened 2 weeks ago

rajachan commented 2 weeks ago

With user buffer registration capability, when a network plugin reports support for regIsGlobal, NCCL does maintain a cache of registration handles (originally registered with a loopback communicator). At the time of a send, it still calls into the regMr hook of the network plugin for the actual communicator that will be used for the data transfer (in case the net plugin requires communicator-specific state for the registration). With regIsGlobal guarantee, it is possible for NCCL to reuse the handle it has cached, but it does not do that today. This commit introduces a MR cache that is similar in design to NCCL's internal cache (with a linear search in the cache to find a registration that fully covers the list of pages of the buffer in question) to avoid redundant (and expensive) registrations with the underlying device.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

rajachan commented 2 weeks ago

After some comprehensive testing, we should revert d9c416f to go along with this.

bwbarrett commented 2 weeks ago

Moving this to draft, until Raghu posts his update with the refactoring.

rauteric commented 3 days ago

Updated with latest. Unit tests are complete and passing. The interface is still WIP.