StanfordLegion / gasnet

11 stars 13 forks source link

Obsolete GASNet versions in repository #10

Closed elliottslaughter closed 1 year ago

elliottslaughter commented 1 year ago

Per https://github.com/StanfordLegion/gasnet/pull/9#issuecomment-1385858418, we have some obsolete GASNet versions in this repository. As of this moment, the oldest versions are 1.32.0 and 2021.3.0.

Per https://github.com/StanfordLegion/gasnet/pull/9#issuecomment-1386002180 we may be able to replace GASNet 1.32.0 with a new version of EX (at least 2022.3.0), but this depends on the driver used on Quartz.

bonachea commented 1 year ago

Summarizing from the linked comment... we think the only remaining use case for the ancient/unsupported GASNet 1.32.0 dependency is OPA systems (which for several years were not supported by GASNet-EX). GASNet-EX restored support for ofi-conduit in v2022.3.0, so we believe that GASNet-EX 2022.3.0+ should be "at least as good as" that ancient G-1 version (and possibly much better). There is ongoing work to improve GASNet-EX's use of the new Cornelis stack (and their new OPX provider), but we believe the ancient GASNet-1 version is not any better off in this regard.

PHHargrove commented 1 year ago

While no release of GASNet (1 or EX) currently supports the "opx" libfabric provider written by Cornelis, I have no reason to doubt that the "psm2" provider is supported at least as well in 2022.9.2 as in any GASNet-1 release. In fact, I would not be surprised if support is better in the new release due to general improvements. All indications I have seen are that "psm2" and "opx" providers currently support the same h/w (and kernel driver) versions.
Fwiw, I do test GASNet-EX over psm2 provider weekly.

elliottslaughter commented 1 year ago

I am trying to contact users of LLNL Quartz to see what the current situation is over there.

So to summarize:

If we can migrate users away from GASNet 1 for this purpose, I will remove it from this repository, which should enable further modernizations.

bonachea commented 1 year ago

@elliottslaughter That's a correct summary, and sounds like the right plan forward.

bonachea commented 1 year ago

I just noticed there's a configs/config.psm.release file that enables psm-conduit (GASNet directly over Omni-Path PSM2).

This conduit was officially deprecated in GASNet 1.30.0 (7a86d93d5b) after Intel (who originally contributed the conduit implementation for their hardware) ceased development on PSM2 and this conduit, in favor of ofi-conduit. psm-conduit appeared as a deprecated conduit in the 1.32.0 release, but has not been maintained since mid-2017 and had several known performance/stability issues at that time. psm-conduit was never ported to the GASNet-EX branch.

@elliottslaughter Hopefully none of your 1.32.0 users are still using psm-conduit? If they are I'd strongly recommend their upgrading to ofi-conduit in GASNet-EX 2022.3.0+, for reasons of performance, stability and support. Note this presumably requires creating a new configs/config.ofi-omnipath.release file for ofi-conduit on Omni-Path, since that use case is currently absent from this repo.

bonachea commented 1 year ago

Turns out most of the configure argument modernization I want to apply can be done even with a 1.32.0 version floor. PR #12 does this.

PR #11 adds an ofi-omnipath config file as suggested in my previous comment (with modernized options, as per PR #12)

bonachea commented 1 year ago

With PRs #11 and #12 merged, I think the remaining open question is validation of Legion using GASNet-EX 2022.3+ on Quartz using the new ofi-omnipath config file, which would allow us to drop the psm config and ancient GASNet 1.32.0 from the repo.

PHHargrove commented 1 year ago

Fwiw, searching through old email I found mention of another OmniPath system at LLNL: https://hpc.llnl.gov/hardware/compute-platforms/ruby

elliottslaughter commented 1 year ago

I am in contact with some of the relevant users now and have asked them to test the new configuration.

bonachea commented 1 year ago

I am in contact with some of the relevant users now and have asked them to test the new configuration.

FWIW the new ofi-omnipath config file I added does not encode any settings related to job spawning, because spawning details tend to be highly system specific and IMO don't belong in a config file that's intended to be generic across all Omni-Path clusters.

For better or worse the current psm config file includes --enable-pmi. So if they were previously using that config and wish to continue using PMI-based spawning on their Omni-Path system, they may need to set GASNET_EXTRA_CONFIGURE_ARGS="--enable-pmi --with-ofi-spawner=pmi" when running make. Otherwise the configure-established default will be ssh-based spawning.

The configure-established default can alternatively be overridden at runtime (assuming the PMI library was detected) by setting envvar GASNET_OFI_SPAWNER=pmi

bonachea commented 1 year ago

PR #13 contains my proposed resolution for this issue, assuming the relevant users confirm that ofi-conduit is working for them.

elliottslaughter commented 1 year ago

I seem to recall the reason for --enable-pmi (as opposed to our usual --enable-mpi-compat) was because of a fundamental limitation in the underlying PSM2. It was unable to create 2 endpoints in the same process, or something along those lines. So PMI was our workaround to avoid the conflict with MPI in the same process.

I realize in an theoretically pure world, the spawner is independent from the conduit, but in our practical usage they seem to be highly correlated (few users ask us to change these defaults), and with PSM2 specifically, I think it can be argued that PMI is the only sensible option, particularly because I'm not actually aware of any Omni-Path machines besides Quartz and Ruby.

Am I missing anything?

bonachea commented 1 year ago

fundamental limitation in the underlying PSM2. It was unable to create 2 endpoints in the same process, or something along those lines.

@elliottslaughter you are correct that PSM2 has a one-endpoint-per-process restriction that makes it harder to use MPI spawner (but not impossible, as many MPI's provide envvar overrides that force MPI to use TCP/IP instead of PSM2, avoiding the problem). In any case due to this restriction I'd never recommend mpi-spawner or --enable-mpi-compat as the first choice on an Omni-Path network (only as a last resort).

Am I missing anything?

What you are missing is the "third option" which is ssh-based spawning (the default spawner for psm/ofi/ibv/ucx-conduits when MPI spawning is disabled). ssh spawner uses fork and TCP sockets for spawning so it (often) does not consume precious resources on the high-speed network, and perhaps more importantly it does not require library dependencies on other software such as a working PMI library+daemon or working MPI library. The main downsides of ssh-spawner are (1) it requires passwordless (e.g. host or agent authenticated) ssh connections to the compute nodes (which can sometimes run afowl of system firewall rules), and (2) it generally requires setting at least one envvar (GASNET_SSH_NODEFILE or GASNET_SSH_SERVERS) and in rarer cases others to get a working spawn.

I'm not actually aware of any Omni-Path machines besides Quartz and Ruby

It's true that big Omni-Path installations are rare in DOE, but they remain more common in Europe and elsewhere. 35 of the Nov 2022 top500 are using Omni-Path, including a German system that is currently number 29, and TACC's Stampede2 which is number 51.

in an theoretically pure world, the spawner is independent from the conduit, but in our practical usage they seem to be highly correlated

The statement above has definitely been true for conduits like aries-conduit, gemini-conduit, pami-conduit, etc which target a proprietary network that is exclusively sold as part of a vendor-integrated system with a software environment that includes a uniform job spawner. However it's far less true for conduits like ibv-conduit, psm-conduit, ucx-conduit, ofi-conduit (non-Cray providers) that target commodity or near-commodity NICs that are often sold separately from the system and the admin is left to design their own software environment with job spawner. These conduits operate in the "wild west" of job spawning, and we routinely see use each of ibv-conduit's three spawning options as the best choice for a given system.

Anecdotally Omni-Path is also somewhat popular in smaller departmental-style clusters. Such clusters are less likely to be professionally managed, making job spawning flexibility more important.

elliottslaughter commented 1 year ago

Thanks, @bonachea, for the context. I appreciate it.

My gut feeling right now is that we need to serve the users we have, before we try to serve (hypothetical) users who might run into issues later on. As you point out, spawners are a larger issue than just the OmniPath network, and it seems to me that this is a larger refactoring that needs to be done in this repository. One option would be to add a SPAWNER variable that parallels CONDUIT. But regardless of the specifics, it seems like a larger change, and one that remains abstract until we get users who need it (and are willing to test drive it for us).

For now I plan to add --enable-pmi --with-ofi-spawner=pmi to ofi-omnipath to get it up to par with what we used to do in psm2. When we get users who need something else, we can revisit this.

bonachea commented 1 year ago

it seems to me that this is a larger refactoring that needs to be done in this repository. One option would be to add a SPAWNER variable that parallels CONDUIT.

To be clear, I was not intending to argue for the refactoring you describe. The Makefile already provides the GASNET_EXTRA_CONFIGURE_ARGS variable as a "manual override" that can be used to tweak site-specific settings like spawner. That seems sufficient for now?

I plan to add --enable-pmi --with-ofi-spawner=pmi to ofi-omnipath to get it up to par with what we used to do in psm2

Note that second argument would be a change in behavior relative to the psm config, which passed the first argument to force detection of PMI, but did not pass the second argument to actually set it as the default spawner : meaning the default spawner was still ssh-spawner (a position in the config space which doesn't make much sense to me). The change you propose would both force-require PMI and set it to the default spawner, which makes more sense for sites with a working PMI; however that will result in a configure error on sites lacking PMI.

Note the current behavior of ofi-omnipath with neither configure option is to auto-detect PMI, link it in only if it appears usable, and leave the default spawner as SSH (but spawner selection can be changed at runtime).

elliottslaughter commented 1 year ago

Ok, thanks. Given this situation I'll leave it as is for now, and be in contact with the main users of OmniPath to see if this work for them.

Is there a way to set a preference order on the default spawner? Usually we'll probably want something like:

SSH has the top default doesn't make a lot of sense for most Legion users unless nothing else was successfully configured.

PHHargrove commented 1 year ago

@elliottslaughter As Dan sort of mentions earlier in this issue, we only have the means to set a default at configure time and an env var to make the final selection at run time. The only "smarts" is the configure-time "is MPI available?" selection between MPI or SSH as the value used in the absence of a configure option.

I want to note that the list you suggest is one I'd not favor as a default, because I've seen enough systems where the PMI libraries exist but are useless for job launch. So, the best option might be to provide the means to express the desired precedence, rather than make that decision in advance.

Prior to this discussion, I'd not given any thought to a prioritized list. However, use of a colon-delimited list (as for $PATH) is conceivable, for both the configure option and env var. If this idea interests you, please consider entering a enhancement request at https://gasnet-bugs.lbl.gov/bugzilla

elliottslaughter commented 1 year ago

14 was the last known user of 1.32.0. I will now remove that version from the repository.

elliottslaughter commented 1 year ago

I merged #13. If there are any other modernizations you'd like to do, let me know.

bonachea commented 1 year ago

@elliottslaughter I think all our main configure modernization concerns are now resolved.

Thanks!