aws / aws-ofi-nccl

This is a plugin which lets EC2 developers use libfabric as network provider while running NCCL applications.
Apache License 2.0
129 stars 51 forks source link

Shrink control message to 32 bytes #437

Closed bwbarrett closed 1 week ago

bwbarrett commented 3 weeks ago

EFA's inline size is 32 bytes. While there's a bunch of work required in Libfabric to get to using all 32 bytes of inline data for the app buffer (ie, the control message), this is what I think is the sanest way to get our header down to 32 bytes.

This is still a draft. Ideally to support other networks, we would have a flexible array to end the control message that was sized based on the mr key size reported by the provider. But wanted to get general sign off on the approach first.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

bwbarrett commented 3 weeks ago

I don't love the type field handling, but don't have a more awesome idea. We can't burn 8 bits on the type and ever hope to have things fit. Ideas welcome.

AmedeoSapio commented 3 weeks ago

bot:aws:retest

liralon commented 3 weeks ago

Actually here is one more thing you should add: Change the call site of fi_mr_key() (In insert_send_ctrl_req()) to have an assert that returned mr_key is <= UINT32_MAX before truncating it into buff_mr_key[i].