flux-framework / flux-sched

Fluxion Graph-based Scheduler
GNU Lesser General Public License v3.0
86 stars 40 forks source link

Short-term work in support of UQ workload with flux-sched #120

Closed dongahn closed 7 years ago

dongahn commented 8 years ago

This is just to summarize my take on a set of investigation points that can position flux-sched to cope better with uncertainty-quantification (UQ) workload, an emerging workload type that is becoming increasingly important for the DOE labs. These came from looking more closely into MiniUQPipeline (MiniUQP), talking to an UQPipeline (UQP) developer again, and doing some literature search on the production solutions that aim at supporting this type of workload.

MiniUQP is a mock-up program that Luke Johnston (Tammy Dahlgren’s summer intern) put together in order to express the scheduling behavior of UQP as well as its resource management and scheduler need to Flux using much simpler software. The state-of-the-art production solutions are all centered on the notion of job array which is supported by many production schedulers: i.e., PBS, PBS Pro, LSF, and SLURM.

My high-level summary is three-fold:

The following details each of them.

dongahn commented 8 years ago

Technical Challenges

High job throughput

We are all already familiar with this challenge. UQ users often want to run massive numbers of short-running jobs, and this can make the batch scheduler become on the critical path to the overall job throughput, choked with resource thrashing, and even ceased to work due to resource exhaustion. Perhaps more importantly, this can also lead to global scheduling-performance degradation as each small job gets bubbled up to the centralized batch system to keep track of.

A classical problem can be demonstrated when a UQ user submits a request to UQP with a large ensemble of not-so-demanding runs (e.g., 1-D simulation). For example, a user can easily request 20,000 sampling points, each only requiring 5-second simulation.

Because of some of the technical issues described above, UQP currently uses or investigates some ad hoc mechanisms: e.g., the code_concurrency parameter to throttle the number of concurrent runs, an environment variable to adjust the wake-up time of UQP's scheduler loop to reduce thrashing, and cram to buddle up the runs so as to avoid resource exhaustion.

MiniUQP can be configured to demonstrate these issues; and flux scheduling can help solve this problem more elegantly.

Affinity control

The most commonly used mode of UQP is so-called batch-interactive. In this mode, a user submits the UQP manager itself into a batch system. Upon the requested allocation granted, the UQP manager runs the ensemble of jobs, each with a separate launch invokation (e.g., srun) -- let's call each invocation a job step, borrowing commonly used terminology.

In MiniUQP, each step appears to rely on default affinity policy (e.g., auto-affinity in SLURM) and this can often lead poor performance. In UQP, they create and manage sub-slots given an allocation and confine each step to some set of these custom slots.

There are much room for improvement so that UQP does not have to manage this complexity on their own and simply use a scheduler-provided policy that govern the binding of many steps within the allocation.

In fact, MiniUQP document notes that managing this complexity at the UQP level is not a good practice and that once Flux provides good primitives they can latch onto them.

MiniUQP can be configured to demonstrate some of these issues, flux scheduling can provide an elegant solution.

File system

UQP can create large number of files. Consider N number of runs in the ensemble. UQP creates 2N directories and 6 ~ about a dozen files per run -- the exact file count per run differ depending on which UQP operation mode is used.

They are by and large small files and this, in combination with K number of files that application itself produces per run, can present high loads to the underlying file system. This can be particularly problematic on LUSTRE where a metadata storm significantly affects global system performance.

While the resource manager and scheduler cannot help reduce the numbers of files produced by the application, they can for UQP files.

MiniUQP can be configured to demonstrate some of this challenge, and a deeper integration between UQP and Flux can alleviate some of these challenges.

dongahn commented 8 years ago

State of the Art: Job Array

The de factor support for UQ-type workloads, which is implemented in many of the production scheduler (e.g., PBS, PBS Pro, LSF, and SLURM), is job array.

Essentially, a user would describe many similar jobs with a single script using a job-array syntax and submit it to the scheduler. The scheduler then invokes many run instances by passing a slightly different parameter set and environment.

The core parameter of job array is the index value of the job array. It is somewhat analogous to MPI rank, and as such the scheduler passes the rank of the current job invocation within the submitted job array. Using this runtime index value, the job script can differentiate what each job must do (e.g., change the current working directory where job-specific input data resides).

All of the schedulers that support job array provide ways to describe a non-unit stride in the index space and some advanced ones also provide ways to throttle the number of concurrent jobs to run.

However, the job-array approach falls short of addressing the technical challenges presented by UQ workload. The main issue is that job array is too static and thus inflexible to handle the dynamism required by today's UQ load. For example, UQ workload may require more sampling points (and thus more jobs) to be added as some runs complete. UQP also need to re-run when a job fails and one cannot easily express this using job array. And this list goes on and on.

dongahn commented 8 years ago

Short-term Investigation

There are a set of short-term investigation steps within flux-sched to position us better to deal with the UQ challenges. A majority of them are what needs to be done anyway as part of our next step (e.g., performance analysis and tuning for flux-sched). The effort is desired to result in quantifiable benefits against other approaches and should streamline other longer-term efforts.

Short-term study plan:

Streamlining longer-term investigation:

dongahn commented 7 years ago

Most of the above has been investigated as part of our fully hierarchical scheduling research.