aws / aws-ofi-nccl

This is a plugin which lets EC2 developers use libfabric as network provider while running NCCL applications.
Apache License 2.0
140 stars 54 forks source link

rdma: support recv size < send size #436

Closed AmedeoSapio closed 3 months ago

AmedeoSapio commented 3 months ago

This adds the support for the case when the size of the recv() is less than the size of the send(). In non-eager sends, when the ctrl msg arrives with a remote length less than the size in the send call, we reduce the size of the send request. The test function for the send will return the smaller size. This implicates that we delay creating the schedule to when we have the control message, instead of when the send is called.

In eager mode, the send transmits the amount of data passed with the send, but the recv side will truncate the data before starting the copy from the bounce buffer to the user buffer. The ctrl message will notify the sender of the reduced size, so the sender will know (when calling test()) of the truncation.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.