Open elliottslaughter opened 2 days ago
Just FYI, we've seen this in the wild. @syamajala installed cuNumeric with GASNet from the Conda packages, but the GASNet wrapper was not included, and instead of getting a hard error it was just a warning, which we initially missed in our testing. We spent some time chasing down an OOM condition that turned out to be because we were running the job in duplicate, which was ultimately a waste of time since the memory usage would have been fine if we'd known the network had failed to initialize.
CC @manopapad
It is possible that the application is built with multiple network modules, so even if ucx is failed, we may fall back to try other networks.
Could we do something like keep two counters:
int num_networks_attempted = 0;
int num_networks_initialized = 0;
And if num_networks_attempted > 0 && num_networks_initialized == 0
then throw an error?
Basically if we attempt any networks, at least one should succeed.
Yeah, that is doable. @muraj @apryakhin Let me know what do you think.
The main thing we'll lose if we do this is omnibus builds, where you build every possible option at build time and then enable only what is available at runtime.
If that's a goal, perhaps we could add a flag like:
-ll:networks <N>
And then the condition becomes num_networks_initialized >= N
. That is, the flag essentially says "Please make sure we successfully initialize at least this many networks."
Depending on what we expect the most common use case to be, we can pick the default to be either:
0
: i.e., never fail by default even if we built with a network1
if we built with at least one network, otherwise 0
: so that we catch failures where the user expected to have a network but it didn't workI'd personally expect omnibus builds to not be the most common deployment option for Legion, but I could be convinced otherwise if other people have opinions.
There is an option -ll:networks
already, that expects a list of networks to try in order, or the special "none".
IMHO if Realm was built with any network, then at runtime it should try all its built-in networks until it finds one that works. If no network works, then it should fail w/o falling back to non-networked execution (i.e. the "none" network option should not be considered by default, unless explicitly passed by the user in -ll:networks
).
FWIW in Legate we're currently doing separate builds for UCX and GASNetEx, but we'd like to be doing omnibus builds. We are prepared to pass -ll:networks none
explicitly when doing single-node runs (and can leave -ll:networks
undefined in multi-node runs).
I would be fine with @manopapad's proposal.
The question is should we fall back to the none
network option? Currently, we always fall back without throw a failure.
So I complained about this to @streichler and @SeyedMir a while ago. The reason that was given to me before was that Realm wanted to try and make progress and run even if modules like UCX or GASNet didn't load correctly, that way a binary could be portable to multiple machines. I'm not sure if that is a property that we still want to maintain or not.
I built Legion with UCX. For some reason the network bootstrap is failing, but my job still runs:
I'm not sure this is the behavior we want. If the user requested networking and it fails to load (for any reason), we should fail hard and fast, and not continue to run anyway.