Open SeyedMir opened 3 weeks ago
FYI @streichler @eddy16112
It's not clear to me based on the ActiveMessage class who's responsibility it is to fragment a large payload. I've hit this issue with UCX a bunch of different ways in the past and just kind of tried to make sure I didn't send large payloads, but that's not a really scalable solution.
I think this is a variation of #1229 (although the description here is a lot more clear than the one in that issue). Anyway, it's a known issue for a long time and @streichler suggested some potential directions in the other thread.
If we need to send a large message. what should we do? manually split the message? I think we have other active messages that are not protected by recommended_max_payload
.
Quoting from https://github.com/StanfordLegion/legion/issues/1229#issuecomment-1086280169:
... I need to either:
break microop requests into potentially multiple packets (gross), or
break a request like this into multiple microops (good for parallelism, but duplicates a bit of effort like reading the pointer fields)
One concern with approach (2) is that on some systems, the max medium message is a lot smaller than 64K (e.g. 4K) and that might be a LOT more microops.
From https://github.com/StanfordLegion/legion/issues/1229#issuecomment-1500721477:
... I'm pretty sure (2) is the right answer, and we can look at the performance with more-but-smaller microops to confirm it's not a big deal.
Seems like it addresses this use case?
If we need to send a large message. what should we do? manually split the message? I think we have other active messages that are not protected by
recommended_max_payload
.
If you're asking the network module to provide a payload buffer for that large message, then yes, it needs to be split into smaller chunks. The UCX backend cannot provide arbitrarily large payload buffers.
It's probably depends on the use case but I think we should use recommended_max_payload
and try to fragment and pipeline things in Realm if we can as for example in the context of dependent partitioning image/preimage operations. On the other hand, just thinking out loud - if we need to deliver a very large payload (exceeds max recommended) to multiple peers we can go down the path of collectives and rely let's say on UCC to do the right thing (for instance scatter + allgather).
Whenever Realm asks the network module to provide a payload buffer, the corresponding size must be lower than a threshold that the network module recommends (through
recommended_max_payload
API). However, this requirement is ignored in https://github.com/StanfordLegion/legion/blob/c61071541218747e35767317f6f89b83f374f264/runtime/realm/transfer/ib_memory.cc#L678 which ultimately leads to an assertion failure in the UCX backend. Here is the backtrace:Until frame 16, the
msglen
is valid (8112). But, in frame 15 it is increased to 8240 which will be invalid (max valid is 8192).