Open rajachan opened 2 weeks ago
After some comprehensive testing, we should revert d9c416f to go along with this.
Moving this to draft, until Raghu posts his update with the refactoring.
Updated with latest. Unit tests are complete and passing. The interface is still WIP.
With user buffer registration capability, when a network plugin reports support for regIsGlobal, NCCL does maintain a cache of registration handles (originally registered with a loopback communicator). At the time of a send, it still calls into the regMr hook of the network plugin for the actual communicator that will be used for the data transfer (in case the net plugin requires communicator-specific state for the registration). With regIsGlobal guarantee, it is possible for NCCL to reuse the handle it has cached, but it does not do that today. This commit introduces a MR cache that is similar in design to NCCL's internal cache (with a linear search in the cache to find a registration that fully covers the list of pages of the buffer in question) to avoid redundant (and expensive) registrations with the underlying device.
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.