Closed rauteric closed 3 months ago
I find your commit message not sufficiently accurate and detailed (E.g. The regMr()
is called for each proxy-thread communication task initialization, not for every communication transfer with the task). I think the following is more accurate (Please change commit message and TODO comment in code accordingly):
When net-plugin returns regIsGlobal=1
to NCCL (As part of net-plugin getProperties()
API), it signals to NCCL that registered MRs are global, in the sense that they can be used by all communicators. In addition, it also signals to NCCL that the net-plugin have a fast MR cache such that calling regMr()
on same buffer (address and size), will quickly return a previously globally registered MR on same buffer.
When user registers a buffer with NCCL by using ncclCommRegister()
API, if net-plugin supports regIsGlobal=1
, NCCL will register the buffer globally once (On each net device) with regMr()
API. When the net proxy-thread starts to execute a communication task on a previously registered user buffer, it will call the net-plugin regMr()
to quickly fetch the previously globally registered MR from the plugin managed MR cache.
Even though aws-ofi-nccl registers MRs globally (As MRs registered on EFA are not specific to a communicator), aws-ofi-nccl doesn't have such a fast MR cache yet. Therefore, it should return regIsGlobal=0
for now.
I updated the commit and code comments with Liran's text.
Added two patches from Amedeo to this PR, and changed the PR title accordingly.
Added a fix from Amedeo for the last commit.
Rebased master to get functional tests for CI.
regIsGlobal = 0
always until we resolve perf (see comment in commit)By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.