Open grondo opened 2 years ago
Perhaps another useful feature would be making exclusion of the router nodes (if any) and the rank 0 node easier when launching jobs within the allocation. For example, suppose one has a buggy Lustre (🙄) - the nodes where I/O is performed will be more likely to crash than nodes that do not do I/O. So running user tasks on non-critical nodes would improve resilience.
Would an option like --requires=-critical
for flux mini run
/ flux mini submit
work?
Also,
set the broker attribute tbon.fanout=0
may not practical for very large allocations (say, a thousand nodes); and even for a smaller allocation the rank 0 node is critical. So perhaps --resilient
when applied to flux mini alloc
or flux mini batch
could also calculate the number of critical nodes required given the user request and the fanout, and allocate that many extra nodes if they are available.
In case I'm unclear, my cheesy script to facilitate this right now looks like this:
[faaland1@fluke108 branch:main flux-usage] $./flux-alloc-resilient calc_size --app 64 --fanout 16
request 70 nodes by providing flux mini arguments
-N70 --broker-opts=-Stbon.topo=kary:16
[faaland1@fluke108 ~] $flux mini alloc -N23 --broker-opts=-Stbon.topo=kary:4
flux-job: Æ’kwfnGGNvSB started
[faaland1@fluke6 branch:main flux-usage] $flux resource list
STATE PROPERTIES NNODES NCORES NGPUS NODELIST
free batch 23 92 0 fluke[6,8,10,12-13,16,19-21,25-29,32-38,41-42]
allocated 0 0 0
down 0 0 0
[faaland1@fluke6 branch:main flux-usage] $./flux-alloc-resilient exclude_list
run application on non-critical nodes by providing flux mini argument
--requires=-host:fluke[6,8,10,12-13,16]
In order to fully solve #4417 we need to add an easy way to configure various options for the job and child instance which allow full resiliency to be realized. This option could be
flux mini batch --resilient
or a different option (not sure if we'd want to allow levels of resiliency). We also need to collect the list of options and tunings that this would enable, e.g.tbon.fanout=0
-o exit-timeout=none