StanfordLegion / legion

The Legion Parallel Programming System
https://legion.stanford.edu
Apache License 2.0
676 stars 145 forks source link

When network bootstrap fails, job runs in duplicate #1767

Open elliottslaughter opened 2 days ago

elliottslaughter commented 2 days ago

I built Legion with UCX. For some reason the network bootstrap is failing, but my job still runs:

$ srun ./build/bin/circuit 
/scratch/eslaught/legion-test-ucx/runtime/realm/ucx/bootstrap/bootstrap_loader.cc:60: NULL value Bootstrap unable to load 'realm_ucp_bootstr
ap_mpi.so'                                                            
        realm_ucp_bootstrap_mpi.so: cannot open shared object file: No such file or directory
/scratch/eslaught/legion-test-ucx/runtime/realm/ucx/bootstrap/bootstrap_loader.cc:60: NULL value Bootstrap unable to load 'realm_ucp_bootstr
ap_mpi.so'                                                                                                                                  
        realm_ucp_bootstrap_mpi.so: cannot open shared object file: No such file or directory
[0 - 7fc5a97da3c0]    0.000000 {5}{ucp}: bootstrap_loader_init failed                                                                       
[0 - 7fc5a97da3c0]    0.000000 {5}{ucp}: failed to bootstrap ucp
[0 - 7fc5a97da3c0]    0.000000 {5}{ucp}: failed to create UCP network module
[0 - 7fc5820a6c80]    0.234238 {3}{circuit}: circuit settings: loops=2 pieces=4 nodes/piece=1024 wires/piece=1024 pct_in_piece=95 seed=12345
[0 - 7fc5820a6c80]    0.249111 {3}{circuit}: Initializing circuit simulation...
[0 - 7fc5820a6c80]    0.415783 {3}{circuit}: Finished initializing simulation...     
Starting main simulation loop                                         
SUCCESS!                                                              
ELAPSED TIME =   1.994 s                                                                                                                    
GFLOPS =   3.944 GFLOPS                                                                                                                     
[0 - 7fc5820a6c80]    2.410444 {3}{circuit}: simulation complete - destroying regions
[0 - 7fee552c43c0]    0.000000 {5}{ucp}: bootstrap_loader_init failed                                                                       
[0 - 7fee552c43c0]    0.000000 {5}{ucp}: failed to bootstrap ucp
[0 - 7fee552c43c0]    0.000000 {5}{ucp}: failed to create UCP network module
[0 - 7fee2db90c80]    0.238475 {3}{circuit}: circuit settings: loops=2 pieces=4 nodes/piece=1024 wires/piece=1024 pct_in_piece=95 seed=12345
[0 - 7fee2db90c80]    0.253128 {3}{circuit}: Initializing circuit simulation...
[0 - 7fee2db90c80]    0.424531 {3}{circuit}: Finished initializing simulation...     
Starting main simulation loop                                         
SUCCESS!
ELAPSED TIME =   2.556 s
GFLOPS =   3.077 GFLOPS
[0 - 7fee2db90c80]    2.981113 {3}{circuit}: simulation complete - destroying regions

I'm not sure this is the behavior we want. If the user requested networking and it fails to load (for any reason), we should fail hard and fast, and not continue to run anyway.

elliottslaughter commented 2 days ago

Just FYI, we've seen this in the wild. @syamajala installed cuNumeric with GASNet from the Conda packages, but the GASNet wrapper was not included, and instead of getting a hard error it was just a warning, which we initially missed in our testing. We spent some time chasing down an OOM condition that turned out to be because we were running the job in duplicate, which was ultimately a waste of time since the memory usage would have been fine if we'd known the network had failed to initialize.

CC @manopapad

eddy16112 commented 2 days ago

It is possible that the application is built with multiple network modules, so even if ucx is failed, we may fall back to try other networks.

elliottslaughter commented 2 days ago

Could we do something like keep two counters:

int num_networks_attempted = 0;
int num_networks_initialized = 0;

And if num_networks_attempted > 0 && num_networks_initialized == 0 then throw an error?

Basically if we attempt any networks, at least one should succeed.

eddy16112 commented 2 days ago

Yeah, that is doable. @muraj @apryakhin Let me know what do you think.

elliottslaughter commented 2 days ago

The main thing we'll lose if we do this is omnibus builds, where you build every possible option at build time and then enable only what is available at runtime.

If that's a goal, perhaps we could add a flag like:

-ll:networks <N>

And then the condition becomes num_networks_initialized >= N. That is, the flag essentially says "Please make sure we successfully initialize at least this many networks."

Depending on what we expect the most common use case to be, we can pick the default to be either:

  1. 0: i.e., never fail by default even if we built with a network
  2. 1 if we built with at least one network, otherwise 0: so that we catch failures where the user expected to have a network but it didn't work

I'd personally expect omnibus builds to not be the most common deployment option for Legion, but I could be convinced otherwise if other people have opinions.

manopapad commented 2 days ago

There is an option -ll:networks already, that expects a list of networks to try in order, or the special "none".

IMHO if Realm was built with any network, then at runtime it should try all its built-in networks until it finds one that works. If no network works, then it should fail w/o falling back to non-networked execution (i.e. the "none" network option should not be considered by default, unless explicitly passed by the user in -ll:networks).

FWIW in Legate we're currently doing separate builds for UCX and GASNetEx, but we'd like to be doing omnibus builds. We are prepared to pass -ll:networks none explicitly when doing single-node runs (and can leave -ll:networks undefined in multi-node runs).

elliottslaughter commented 2 days ago

I would be fine with @manopapad's proposal.

eddy16112 commented 2 days ago

The question is should we fall back to the none network option? Currently, we always fall back without throw a failure.

lightsighter commented 2 days ago

So I complained about this to @streichler and @SeyedMir a while ago. The reason that was given to me before was that Realm wanted to try and make progress and run even if modules like UCX or GASNet didn't load correctly, that way a binary could be portable to multiple machines. I'm not sure if that is a property that we still want to maintain or not.