Closed elliottslaughter closed 1 year ago
It matches what I was thinking of, but now that I see it, I wonder if SYSTEM
is too generic a name? make
will initialize variables from the environment, so if anybody has SYSTEM=...
in their environment, they won't get what they want here.
I'm happy to bikeshed as desired. Some possible options:
SUBCONFIG
SUBCONDUIT
NETWORK
NETWORK_SYSTEM
CONDUIT_SYSTEM
GASNET_SYSTEM
how about GASNET_CONDUIT
and GASNET_SYSTEM
, with CONDUIT
being a legacy alias for GASNET_CONDUIT
?
@streichler Ok, how does it look now?
Re: naming choice of $GASNET_SYSTEM
:
I don't love the subtle implication that this setting is somehow an entity defined by GASNet (unlike $GASNET_CONDUIT
, which DOES correspond to a GASNet concept).
The values of $GASNET_SYSTEM
represent canned configurations that are entirely a fabrication of this StanfordLegion/gasnet repository and don't exist in implementation or documentation anywhere else. I understand this name is motivated by LEGION's use of GASNet, but that subtle distinction is likely to be lost on users. As such I'd prefer a name that didn't include GASNET_ to reflect that. From @elliottslaughter 's suggestions, any of SUBCONFIG
, NETWORK_SYSTEM
or CONDUIT_SYSTEM
all seem like better choices.
Just my 2c...
I think it's also worth mentioning here that the actual distinction being made here is GASNet's configure --with-ofi-provider
setting, which (when omitted) DOES include automatic defaulting logic that works reasonably well on many systems (configure will use fi_info
to query for available providers). So it might be worth providing a config flavor that omits that argument and uses auto-detection, thus only requiring this override for rare systems where the detection doesn't work (e.g. where the build node lacks the high-speed network hardware).
FWIW we also provide a configure --with-ofi-provider=generic
setting that builds the conduit in a portable mode for ANY supported provider, at some additional overhead cost in adaptation to the provider at runtime (instead of statically for a specific provider), So that's also available as an "always works" option, although sub-optimal for any particular system.
CC: @PHHargrove
Thanks, @bonachea. About the auto-configure option: are there any plausible scenarios where configuration might succeed but then fail to find the correct OFI provider? I.e., can we rely on this logic to either: (a) do the "right" thing (for some definition of "right"), or (b) fail outright? Or are there any scenarios where there is a genuine question of what the "right" answer is (e.g., networks on which there are legitimately two different providers that users might want to use, depending on the circumstances)?
are there any plausible scenarios where configuration might succeed but then fail to find the correct OFI provider?
Yes. Some examples include:
fi_info
at configure time only reports generic TCP-based providers instead of the high-performance PSM2 provider that should be used on the compute nodes.I.e., can we rely on this logic to either: (a) do the "right" thing (for some definition of "right"), or (b) fail outright? Or are there any scenarios where there is a genuine question of what the "right" answer is (e.g., networks on which there are legitimately two different providers that users might want to use, depending on the circumstances)?
As I mentioned, there's a "generic" provider setting that does all the provider adaptation at runtime based on the "best" provider it finds at job startup (where "best" is currently defined by this ordered list: "cxi psm2 gni verbs;ofi_rxm efa sockets udp;ofi_rxd tcp;ofi_rxm"). This adds some runtime overhead in the steady state, but is the "safest" option in cases where we really cannot or prefer not to make a decision at configure time.
Assuming you are NOT using that --with-ofi-provider=generic
provider, the remaining options:
--with-ofi-provider
option), which queries fi_info
and applies the priority list above to statically compile for the best provider it finds in the configure environment.--with-ofi-provider=X
which demands particular provider X at configure time.Both select a provider at configure time and apply static optimizations to the conduit code based on that choice. There are at least two ways this might get the "wrong answer" in corner cases:
Even outside of ofi-conduit, our startup checks in NDEBUG mode include looking for particular hardware (e.g. Mellanox InfiniBand HCAs) and warning if the conduit/provider choice looks sub-optimal. However these checks are not fool-proof, and they might be ignored or silenced by the end-user.
Thanks. My gut feeling based on my current understanding of the tradeoffs is to continue down the current path and just fine-tune the variable names to accurately reflect what we're trying to accomplish.
This approach allows for GASNET_SYSTEM
(discussion on naming to follow below) to be blank, which would mean that a system-agnostic config/config.ofi.release
file would probably need that auto-detection goodness to work at all?
For the naming, I understand the concern, but the variables are specific to the GASNet build we're doing, so having GASNET in the name seems reasonable. If we wanted to be explicit that it was relevant (only) to Legion's build of GASNet, then that would suggest LEGION_GASNET_CONDUIT
and LEGION_GASNET_SYSTEM
. The other way out of this is to tell make to ignore SYSTEM
if it came from the environment (i.e. it'd have to be on the command line to have an effect), but that'd probably surprise people too.
This approach allows for
GASNET_SYSTEM
to be blank, which would mean that a system-agnosticconfig/config.ofi.release
file would probably need that auto-detection goodness to work at all?
As currently written I don't think the code literally allows for GASNET_CONDUIT=ofi GASNET_SYSTEM="" input.
However I agree with Sean's suggestion that offering more generic options could be valuable. Generalizing, you could even provide both config/config.ofi-auto.release
and config/config.ofi-generic.release
, where GASNET_SYSTEM=auto omits --with-ofi-provider
to activate configure-time "auto-detection goodness" and GASNET_SYSTEM=generic passes --with-ofi-provider=generic
to activate the fully general (but most expensive) runtime adaptation.
To add my $0.02 USD regarding naming:
If, as I gather from @elliottslaughter's initial comment, the name fragment SYSTEM
is meant to allow the user to convey both the ideas "use this network" or "use settings appropriate for Frontier", then I offer PLATFORM
and TARGET
as possible synonyms.
Please don't use GASNET_
as the prefix. Doing so risks conflict with things in GASNet itself, especially since GNU Make will export all make variable settings to the environment by default. For instance, GASNET_PLATFORM
is a shell variable used in our configure
.
I have updated the PR to use the names LEGION_GASNET_CONDUIT
and LEGION_GASNET_SYSTEM
. The old spelling CONDUIT
is still supported for backwards compatibility with existing users.
Let me know if you have any further concerns.
This allows you to run e.g.:
Instead of:
(The
CONDUIT
spelling is still supported for backwards compatibility.)The intention is that
LEGION_GASNET_SYSTEM
could be a specific system (e.g., Frontier) or a class of systems (e.g.,slingshot11
for all systems that use the Slingshot 11 network).This is intended to help us address https://github.com/StanfordLegion/legion/issues/1468.