Problems with ucx-conduit+PSHM in CI

bonachea commented 1 year ago

Background

This issue is forked from issue #6, where the GASNet-EX configure default of --enable-pshm was restored for most Realm build configurations in 7a073d32039cd0e6735a35f93f28ba47349f4350, thereby enabling GASNet's efficient shared-memory transport, which provides huge speedups for intranode comms when running multiple processes-per-node.

Unfortunately initial CI testing with ucx-conduit+PSHM in CI led to some new failures, and as a result PSHM support was quickly re-disabled for the ucx-conduit configuration in f9d1a06553f4f8a230a339c1e321a62825de3fd7. This issue exists to triage and hopefully solve the CI failures, so the PSHM enable can be restored in configs/config.ucx.release.

It's worth noting that ucx-conduit currently remains an "experimental" conduit (and likely to remain that way in the near-term), for reasons of both stability and performance. As of the current GASNet v2022.9.0 release there's very very few use cases where ucx-conduit might be preferable to either ibv-conduit (on InfiniBand systems) or ofi-conduit (on Slingshot-10 systems). Those production-quality conduits are currently both more robust and more performant than ucx-conduit in basically all our testing. So IMHO Legion users should never be using ucx-conduit in production, meaning this issue to polish Legion's use of ucx-conduit is probably low-priority.

Initial requests:

The provided pipeline log reveals it was built against (1.5 year old) GASNet-EX version 2021.3.0. This is despite recent commit 973d1a510267976f0c389ca348daeed21617e30e that sets this repo's Makefile default GASNET_VERSION to the current GASNet release, so I'm guessing this an accidental oversight in the CI scripting. There have been non-trivial improvements made to both ucx-conduit and PSHM internals since 2021.3.0, so can we please re-run against the current GASNet-EX 2022.9.0 release to avoid potentially wasting time triaging already-fixed defects?
Can we please try re-runs using GASNet's --enable-debug aka GASNET_DEBUG mode to enable assertions and envvar GASNET_BACKTRACE=1 to get backtraces? This might help us narrow down what's happening (e.g. if Realm happens to be breaking any checkable preconditions on GASNet calls).

ucx-conduit/terra failure mode

The ucx+PSHM failure point on the two terra tests looks like this:

WARNING: ucx-conduit is experimental and should not be used for
          performance measurements.
          Please see `ucx-conduit/README` for more details.
[0 - 7f4da2147bc0]    0.223741 {6}{realm}: network still not quiescent after 10 attempts
[1 - 7fa8a5eddbc0]    0.223742 {6}{realm}: network still not quiescent after 10 attempts
Signal Signal 6 received by node 0, process 6 received by node 1, process 100718 (thread 7fa8a5eddbc0100717 (thread 7f4da2147bc0) - obtaining backtrace

Based on the message I'm assuming something in the realm logic decided to "give up" on test program exit quiescence, presumably based on some heuristic (of which I have no knowledge). Could someone explain how that works? In particular, does it use real wallclock time (Does 0.223742 indicate it gave up after about a ~200 ms timeout?), or does it rely primarily on the latency/overheads of GASNet AM (which differs wildly between the UCX and shared-memory transports, meaning the heuristic might just need adjustment?).

Recommendation: Investigate the quiescence heuristic, and in particular the time basis for the abort condition

CC: @streichler @elliottslaughter @phhargrove

streichler commented 1 year ago

I'll need to dig into this a bit. We've been seeing some failures like this every once in a while, but haven't been able to reproduce them reliably. The Realm quiescence check is not time-based, but instead assumes that a gasnet collective is not measurably faster than the network latency of an AM request. If that's not true with pshm enabled, I may need to go back to rolling my own AM-based "collective" operation to flush things.

bonachea commented 1 year ago

@streichler wrote:

The Realm quiescence check is not time-based, but instead assumes that a gasnet collective is not measurably faster than the network latency of an AM request.

Not sure exactly what this means. On average and at large scale, gasnet collectives should usually have higher overall latency than an AM request. However there are no implicit ordering constraints between in-flight communication operations; in particular they are NOT guaranteed to complete in the order they were initiated (this applies to all non-blocking GASNet operations). What this means in practice is that in rare/unusual cases an AM request may be delayed and have a higher end-to-end latency than the overall latency of a concurrent collective. Also worth noting: I say "overall latency" because in some cases of load imbalance where a straggler process arrives "last" at a collective, it could see that collective complete almost immediately (without any wire delay), an interval which could be much shorter than any off-node AM latency.

Now if we're talking about small scale, collectives may have overall latencies comparable to an AM request, even on average. PSHM activates hierarchical collectives with very fast on-node coordination, so a mostly on-node collective could easily rival the performance of an off-node AM request or possibly even a single on-node AM. If this performance behavior breaks the quiescence algorithm, then I agree some adjustment may be needed.

If that's not true with pshm enabled, I may need to go back to rolling my own AM-based "collective" operation to flush things.

It is not possible to write anything with AM that is guaranteed to reliably "flush things". Concurrently in-flight AM's are unordered, full stop. Point-to-point AMs of similar size between the same source and target will often arrive in the order they were issued, but this is not in any way guaranteed.

StanfordLegion / gasnet