flux-framework / flux-core

core services for the Flux resource management framework
GNU Lesser General Public License v3.0
167 stars 50 forks source link

Add --resilient or similar option to `flux mini alloc` and `flux mini batch` #4582

Open grondo opened 2 years ago

grondo commented 2 years ago

In order to fully solve #4417 we need to add an easy way to configure various options for the job and child instance which allow full resiliency to be realized. This option could be flux mini batch --resilient or a different option (not sure if we'd want to allow levels of resiliency). We also need to collect the list of options and tunings that this would enable, e.g.

ofaaland commented 1 year ago

Perhaps another useful feature would be making exclusion of the router nodes (if any) and the rank 0 node easier when launching jobs within the allocation. For example, suppose one has a buggy Lustre (🙄) - the nodes where I/O is performed will be more likely to crash than nodes that do not do I/O. So running user tasks on non-critical nodes would improve resilience.

Would an option like --requires=-critical for flux mini run / flux mini submit work?

Also,

set the broker attribute tbon.fanout=0

may not practical for very large allocations (say, a thousand nodes); and even for a smaller allocation the rank 0 node is critical. So perhaps --resilient when applied to flux mini alloc or flux mini batch could also calculate the number of critical nodes required given the user request and the fanout, and allocate that many extra nodes if they are available.

In case I'm unclear, my cheesy script to facilitate this right now looks like this:

[faaland1@fluke108 branch:main flux-usage] $./flux-alloc-resilient calc_size --app 64 --fanout 16
request 70 nodes by providing flux mini arguments
    -N70 --broker-opts=-Stbon.topo=kary:16
[faaland1@fluke108 ~] $flux mini alloc -N23 --broker-opts=-Stbon.topo=kary:4
flux-job: Æ’kwfnGGNvSB started 
[faaland1@fluke6 branch:main flux-usage] $flux resource list
     STATE PROPERTIES NNODES   NCORES    NGPUS NODELIST
      free batch          23       92        0 fluke[6,8,10,12-13,16,19-21,25-29,32-38,41-42]
 allocated                 0        0        0 
      down                 0        0        0
[faaland1@fluke6 branch:main flux-usage] $./flux-alloc-resilient exclude_list
run application on non-critical nodes by providing flux mini argument
    --requires=-host:fluke[6,8,10,12-13,16]