Open rauteric opened 2 months ago
The CI failure(s) are real:
scatter_perf: nccl_ofi_rdma.c:2100: alloc_bounce_req: Assertion `NCCL_OFI_IS_PTR_ALIGNED(&bounce_fl_item->bounce_msg, BOUNCE_BUFFER_ALIGNMENT)' failed.
Need to fix up some asserts.
Force-push to 3494556 fixes the CI failure by having alignment on both (eager) bounce buffers and ctrl recv buffers.
This requirement will be relaxed when we refactor the data structure in the related #600.
Rebased on master
bot:aws:retest
bot:aws:retest
Rebase on master
Rebased on master
Bounce buffers for eager messages and control messages are different things. Have different structures with different names so that there's no reason to figure out what's what.
They are different things, but there is a lot of code that is shared between them (e.g., all of the buffer pool management code, posting receives). We should rename these structures to something more generic (like receive buffers instead of bounce buffers [which I wanted to do in a followup patch]), but I think duplicating all the buffer management code is overkill.
I'm also not sure I understand the desire to move the freelists from the endpoint into the per-rail structures, given that most of the data is really operations that will be per-endpoint.
I'm confused... isn't having the freelists per-rail exactly what you were advocating in https://github.com/aws/aws-ofi-nccl/pull/614#discussion_r1767810883?
I think it makes sense to have the freelists per-rail because posted recv buffer counts are maintained per-rail.
I think the problem with your patch is that it's way too much cognitive load to figrue out what a bounce buffer is. And the rest is just too confusing to review. Fix that.
Separate out bounce buffer freelists into a smaller-sized freelist for control messages and a larger size for data (eager) messages
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.