aws / aws-ofi-nccl

This is a plugin which lets EC2 developers use libfabric as network provider while running NCCL applications.
Apache License 2.0
140 stars 54 forks source link

Use the zero-copy path in the EFA provider for the RDMA protocol #457

Closed AmedeoSapio closed 3 months ago

AmedeoSapio commented 3 months ago

To use the zero-copy path in the EFA provider, for fi_send/fi_recv operations, we need:

  1. to have no ordering requirements (already satisfied)
  2. to not use tagged operations (already satisfied)
  3. to set the FI_OPT_MAX_MSG_SIZE endpoint option to a value less than mtu. Since the larger messages we exchange with fi_send/fi_recv are eager messages, we need them to be less than the mtu, which is satisfied with the current default eager threshold size of 8KB.

This commit is setting the FI_OPT_MAX_MSG_SIZE option to the max size we use for fi_send/fi_recv operations, only for AWS platforms, since this is not needed for other providers.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

AmedeoSapio commented 3 months ago

bot:aws:retest

AmedeoSapio commented 3 months ago

bot:aws:retest