aws / aws-ofi-nccl

This is a plugin which lets EC2 developers use libfabric as network provider while running NCCL applications.
Apache License 2.0
140 stars 54 forks source link

Three patches in support of upcoming release #442

Closed rauteric closed 3 months ago

rauteric commented 3 months ago
  1. Set regIsGlobal = 0 always until we resolve perf (see comment in commit)
  2. Test recv size<send size (in functional test) only if regIsGlobal=1
  3. sendrecv: (fix) do not use cq_entry.len in send completions

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

liralon commented 3 months ago

I find your commit message not sufficiently accurate and detailed (E.g. The regMr() is called for each proxy-thread communication task initialization, not for every communication transfer with the task). I think the following is more accurate (Please change commit message and TODO comment in code accordingly):

When net-plugin returns regIsGlobal=1 to NCCL (As part of net-plugin getProperties() API), it signals to NCCL that registered MRs are global, in the sense that they can be used by all communicators. In addition, it also signals to NCCL that the net-plugin have a fast MR cache such that calling regMr() on same buffer (address and size), will quickly return a previously globally registered MR on same buffer.

When user registers a buffer with NCCL by using ncclCommRegister() API, if net-plugin supports regIsGlobal=1, NCCL will register the buffer globally once (On each net device) with regMr() API. When the net proxy-thread starts to execute a communication task on a previously registered user buffer, it will call the net-plugin regMr() to quickly fetch the previously globally registered MR from the plugin managed MR cache.

Even though aws-ofi-nccl registers MRs globally (As MRs registered on EFA are not specific to a communicator), aws-ofi-nccl doesn't have such a fast MR cache yet. Therefore, it should return regIsGlobal=0 for now.

rauteric commented 3 months ago

I updated the commit and code comments with Liran's text.

rauteric commented 3 months ago

Added two patches from Amedeo to this PR, and changed the PR title accordingly.

rauteric commented 3 months ago

Added a fix from Amedeo for the last commit.

rauteric commented 3 months ago

Rebased master to get functional tests for CI.