Short-term work in support of UQ workload with flux-sched

This is just to summarize my take on a set of investigation points that can position flux-sched to cope better with uncertainty-quantification (UQ) workload, an emerging workload type that is becoming increasingly important for the DOE labs. These came from looking more closely into MiniUQPipeline (MiniUQP), talking to an UQPipeline (UQP) developer again, and doing some literature search on the production solutions that aim at supporting this type of workload.

MiniUQP is a mock-up program that Luke Johnston (Tammy Dahlgren’s summer intern) put together in order to express the scheduling behavior of UQP as well as its resource management and scheduler need to Flux using much simpler software. The state-of-the-art production solutions are all centered on the notion of job array which is supported by many production schedulers: i.e., PBS, PBS Pro, LSF, and SLURM.

My high-level summary is three-fold:

There are at least three apparent challenges that UQ workloads present -- high job throughput challenge, affinity control challenge, and file system challenge;
The job-array approaches fall short of addressing the UQ challenges; and
There are some short-term investigations within flux-sched, which can be done NOW to position us better.

The following details each of them.

Technical Challenges

High job throughput

We are all already familiar with this challenge. UQ users often want to run massive numbers of short-running jobs, and this can make the batch scheduler become on the critical path to the overall job throughput, choked with resource thrashing, and even ceased to work due to resource exhaustion. Perhaps more importantly, this can also lead to global scheduling-performance degradation as each small job gets bubbled up to the centralized batch system to keep track of.

A classical problem can be demonstrated when a UQ user submits a request to UQP with a large ensemble of not-so-demanding runs (e.g., 1-D simulation). For example, a user can easily request 20,000 sampling points, each only requiring 5-second simulation.

Because of some of the technical issues described above, UQP currently uses or investigates some ad hoc mechanisms: e.g., the code_concurrency parameter to throttle the number of concurrent runs, an environment variable to adjust the wake-up time of UQP's scheduler loop to reduce thrashing, and cram to buddle up the runs so as to avoid resource exhaustion.

MiniUQP can be configured to demonstrate these issues; and flux scheduling can help solve this problem more elegantly.

Affinity control

The most commonly used mode of UQP is so-called batch-interactive. In this mode, a user submits the UQP manager itself into a batch system. Upon the requested allocation granted, the UQP manager runs the ensemble of jobs, each with a separate launch invokation (e.g., srun) -- let's call each invocation a job step, borrowing commonly used terminology.

In MiniUQP, each step appears to rely on default affinity policy (e.g., auto-affinity in SLURM) and this can often lead poor performance. In UQP, they create and manage sub-slots given an allocation and confine each step to some set of these custom slots.

There are much room for improvement so that UQP does not have to manage this complexity on their own and simply use a scheduler-provided policy that govern the binding of many steps within the allocation.

In fact, MiniUQP document notes that managing this complexity at the UQP level is not a good practice and that once Flux provides good primitives they can latch onto them.

MiniUQP can be configured to demonstrate some of these issues, flux scheduling can provide an elegant solution.

File system

UQP can create large number of files. Consider N number of runs in the ensemble. UQP creates 2N directories and 6 ~ about a dozen files per run -- the exact file count per run differ depending on which UQP operation mode is used.

They are by and large small files and this, in combination with K number of files that application itself produces per run, can present high loads to the underlying file system. This can be particularly problematic on LUSTRE where a metadata storm significantly affects global system performance.

While the resource manager and scheduler cannot help reduce the numbers of files produced by the application, they can for UQP files.

MiniUQP can be configured to demonstrate some of this challenge, and a deeper integration between UQP and Flux can alleviate some of these challenges.

State of the Art: Job Array

The de factor support for UQ-type workloads, which is implemented in many of the production scheduler (e.g., PBS, PBS Pro, LSF, and SLURM), is job array.

Essentially, a user would describe many similar jobs with a single script using a job-array syntax and submit it to the scheduler. The scheduler then invokes many run instances by passing a slightly different parameter set and environment.

The core parameter of job array is the index value of the job array. It is somewhat analogous to MPI rank, and as such the scheduler passes the rank of the current job invocation within the submitted job array. Using this runtime index value, the job script can differentiate what each job must do (e.g., change the current working directory where job-specific input data resides).

All of the schedulers that support job array provide ways to describe a non-unit stride in the index space and some advanced ones also provide ways to throttle the number of concurrent jobs to run.

However, the job-array approach falls short of addressing the technical challenges presented by UQ workload. The main issue is that job array is too static and thus inflexible to handle the dynamism required by today's UQ load. For example, UQ workload may require more sampling points (and thus more jobs) to be added as some runs complete. UQP also need to re-run when a job fails and one cannot easily express this using job array. And this list goes on and on.

Short-term Investigation

There are a set of short-term investigation steps within flux-sched to position us better to deal with the UQ challenges. A majority of them are what needs to be done anyway as part of our next step (e.g., performance analysis and tuning for flux-sched). The effort is desired to result in quantifiable benefits against other approaches and should streamline other longer-term efforts.

Short-term study plan:

Level 0 Flux/MiniUQP integration
- Initial integration was done by Luke, but more need to be done to this integration to streamline the testing environment
Analyze and tune for the high job-throughput challenge
- Need some instrumentation within flux-sched to easily pinpoint those components/operations that are on the critical path to schedule and run a large ensemble of short-running jobs; @tpatki will work with me to instrument flux-sched to get critial path analysis (CPA) in a meaningful way for this purpose
- Collect CPA signitures against some large ensembles and tune performance of the critial operations (resrc tree traversal for resoource selection? wreck? job eventing rate? etc)
- Quantify the benefits by comparing ours with various intermediate schemes along the way and also SLURM
Design affinity control tailored to UQ workload
- Propose simple, yet effective, affinity control and work with the UQ team to refine and implement
- Demonstrate the utility on MiniUQP
Minimum effort to help alleviate the file system challenge
- A deeper, level-1 Flux/MiniUQP integration: e.g., modifying MiniUQP to directly call Flux interfaces (e.g., JSC) to remove the use of certain files such as applStarted and “applComplete"
- We can further suggest other techniques outside Flux -- techniques that can use Flux primitives but no necessary the core part of Flux.

Streamlining longer-term investigation:

What we demonstrated in MiniUQP should be reduceable to practice by the actual UQP.
Communication that occurs with the UQP team during this should lead to the better understanding of Flux by UQP developers
When capacitor latches on to flux-sched, flux-sched should be in a much better position in terms of dealing with its throughput-oriented scheduling performance.
MiniUQP will use flux-submit as the interface to begin with. But UQP folks can also start to look into capacitor integration. The UQ team is looking at ways to use cram at the moment. Once this work is done, the same cram file can be fed into capacitor.
While we work on practical techniques, we should keep doing more literature search in the research space so that we can also claim novelty
@surajpkn's dynamic scheduling work, when all is said and done, can provide more flexibility and optimization to the UQ workload. That investigation will also benefit from a known quantity we will establish with this work.
CPA instrumentation should be beneficial to analyze and tune the scheduler performance for other workload types later (e.g., large scale HPC loads)

Most of the above has been investigated as part of our fully hierarchical scheduling research.

flux-framework / flux-sched