smallest serviceable slurm substitute

garlick commented 8 years ago

Smallest Serviceable Slurm Substitute

What follows are the requirements to replace the SLURM version currently in use at LC, not a wish list for the perfect batch system. The requirements are listed as bullet items with minimal text to describe the item. This assumes an understanding of SLURM and its features. For further details, reference the SLURM man pages. References to SLURM commands are listed where appropriate. New features in the versions of SLURM beyond v2.3.3 are not listed.

Task launch
- Specify number of tasks
- Specify resources (at least nodes and cores)
  - Number of resources (e.g., 4 nodes)
  - include ranges
  - Named resources (e.g., cluster, node[4-8], core[0-3])
  - Memory size
  - Generic resources
  - Features
- Task distribution:
  - Cyclic
  - Block
  - Plane
  - Custom (base on configuration file)
- Task to resource mapping
  - Number of tasks per node (or core)
  - Number of cores per task
- Hardware threading (desired? allowed? disabled?)
- Task containment - confine tasks to allocated resources: sockets, cores, memory
- Wall clock limit
- Task prolog and epilog options
Resource management
- Resources managed: Clusters, nodes, sockets, cores, threads, memory, GPU’s, burst buffers, file systems, licenses, etc.
- Add and remove resources from management
- Report and change status of resources: up, down, draining, allocated, idle
- Resource pools (aka partitions, queues)
- Resource weights (governs priority for selection)
- Resource sharing allowed (if so, to what degree?)
- Network topology
  - Contiguous resources
  - Switch topology
Resource status (sinfo)
- Summary of nodes and states (idle, allocated, down, draining)
- Summarize for each node partition
- Rich reports of specific resources
  - By node (scontrol show node)
  - By partition (scontrol show partition)
Job Specification
- Job category
  - Batch script (sbatch)
  - Interactive (salloc)
  - includes xterm request (mxterm / sxterm)
  - Single job step as job (srun)
- User / group
- Bank account
- Workload characterization key
- Min/max run times
- Priority (includes nice factor if any)
- QoS
- Queue
- Resource requirements
  - Min/Max node counts
  - Features, tags, processor architecture, processor speed
  - (Minimum or specific) memory per (socket or node)
  - (Minimum or specific) (sockets or cores) per node
  - Tasks per node (or core)
  - Cores per task
  - Shared or exclusive
  - Preferred network topology / node contiguity
  - Licenses
  - File systems
  - Installed packages and libraries
- Allocated resources
  - By count (e.g., number of nodes and cores)
  - By name (e.g., node names, cpu’s, gpu’s, etc.)
  - Node on which batch script is running
- State (includes reason for not running)
- Dependency (other job(s) starting/completing/exit code)
- Reservation
- Prolog and Epilog
- Re-queue request
  - If preempted
  - If resource fails
- Terminate (or continue) on resource failure
- Times
  - Submit time
  - Start-after time
  - Estimated start time
  - Actual start time
  - Run time limit
  - Actual run time
  - Terminate time
- Exit Status (includes if signaled and by which signal)
- Job run info
  - Job name
  - Command
  - Working directory
  - Standard In / Out / Error
  - Batch script
Job Submission
- Option to intercept submit request and alter, override, or insert policy-related options
- Job submission fails at submit time (as opposed to run time) when invalid options are specified
- (Pound) directive support in batch script (e.g., #SBATCH -N) as optional means to convey job specifications
Job status
- One-line job summary (squeue)
  - Queued as well as running jobs
  - Includes jobs of other users
- Verbose job record report (scontrol show job)
- Job step reports
- Includes record of associated batch script
Job control
- Job removal and signaling (scancel)
- Job signal prior to termination (per specified grace time)
- Job modification (scontrol update job)
- Job hold/release
Job prioritization factors
- Fair share
- Job size (favoring large or small)
- Queued time (FIFO)
- QoS contribution
- Queue contribution
- User nicing
Scheduling (starting with a prioritized queue)
- Matches job’s requests with available resources
- Supports multiple rules for resource selection:
  - Best fit
  - First fit
  - Balanced workload
- Job submission requires a bank account and user permission to use that account
- Honors time and resource size limits imposed by
  - Queue
  - QoS
  - User/Bank
- Imposes limits on
  - Number of jobs that can be queued at any given time
  - Number of jobs that can be running at any given time
- Accommodates sharing requests and allowed sharing levels
- Waits specified time to accommodate node topology request
- Backfill option
  - Conservative backfill no higher priority job delayed
  - EASY backfill just the top priority job cannot be delayed
- Provides estimated start times
- Considers jobs for multiple queues
- Supports job dependencies from other clusters
- Provide job preemption based on QoS or queue. Preemption action can be
  - Suspension
  - Checkpoint
  - Terminate and Re-queue
  - Terminate
- Support for job growth and shrinkage
Quality of Service
- Affects job priority
- Allows exemptions from time and size limits
- Can impose an associated set of time and size limits
- Can amplify or dampen the usage charges
Bank Accounts
- Fundamental to permitting user’s ability to submit jobs
- Reflects the sponsors’ claim to the cluster’s resources (i.e., the shares in fair share)
- Can impose an associated set of time and size limits
Reservations
- Resources can be reserved in advance (DATs)
- Permitted jobs can run within those reservations
Email user at job state transitions
- Begin
- End
- Fail
- Re-queue
- All
Resource accounting
- Resource utilization (sreport)
- Times reported for specified time periods under the following categories:
  - Allocated
  - Idle
  - Reserved
  - System maintenance
  - Unplanned down time
Job accounting
- Individual job records (sacct)
  - Job and job step records for a prescribed time period
  - Includes most of the job parameters listed in Job Specification above
- Composite job reports (sreport)
  - Aggregate job reports based on user, account, and workload characterization key
  - Over a prescribed time period
  - Includes listing of top users and top accounts
  - Includes reports by job size
Security
- Jobs can only be run by submitting user
- Job output can only be seen by submitting user
- System parameters can only be changed by authorized roles (see next item)
Administration
- Role-based system administration and overrides
  - User can monitor and alter (some) of own job parameters
  - Operator can alter other users’ job parameters
  - Coordinator can populate bank account memberships and limits
  - Administrator can do all above and alter resource definitions
User/bank management (sacctmgr)
- Cluster/partition/user/bank granularity
- Implicit permission to use bank
- Limits imposed at each level of the hierarchy
- Limits include:
  - Max number of jobs running at any time in bank
  - Max number of nodes for any jobs running in bank
  - Max number of CPUs for any jobs running in bank
  - Max number of pending + running jobs state at any time in bank
  - Max wall clock time each job in bank can run
  - Max (CPU*minutes) each job in bank can run
System
- Save state and recover on restart
  - Resources
  - Jobs
  - Usage statistics
  - System can be restarted without losing queued jobs or killing running jobs
- Reliability
  - High availability backup to take over when primary dies or hangs
  - Resilient able to adapt to failing or failed resources
  - 24x7 operation
  - System updates possible on a live system without losing queued or running jobs
- Robust
  - Atomic changes
  - System can never get in a corrupt or inconsistent state
  - Complete recovery after crashes
- Performance
  - Response to user commands to be less than one or two seconds.
  - Scheduling loops under one minute
- Scalability
  - Thousands of jobs
  - Thousands of resources
  - Thousands of users
- Visibility
  - Pertinent info is logged
  - System diagnostics facilitate a quick discovery of what went wrong
- Configuration
  - System configuration read from file or database
  - System configuration parameters can be changed live
API
- Library to retrieve remaining time (libyogrt)
- Interface to lorenz
Environment Variables
- Support for user defined environment variables to be used to input job specifications (e.g. SBATCH_ACCOUNT)
- System inserts variables into the execution environment to be used by user's script or application (e.g., SLURM_JOB_ID)
- Option to convey some or all of user's environment variables to run time execution environment.

lipari commented 8 years ago

In an effort to help our users transition to using LSF on our CORAL systems, I have created a translation guide that compares the options for submitting jobs to batch schedulers that LC currently supports or has supported in the past. While this is not directly relevant to Flux development, it should serve as a good reference as we work to build out Flux functionality to replace SLURM.

garlick commented 8 years ago

Cool, nice work @lipari.

dongahn commented 8 years ago

Great! bsubdoesn't have a way to specify the number of nodes? Do you want to include the options for burst buffer request? As users may want to use burst buffers for their checkpoint and restart feels? E.g., sbatch now has --bb. What will be the corresponding LSF option(s)?

lipari commented 8 years ago

The doc is somewhat a work in progress from the LSF side. I forwarded a copy to our LSF contacts and asked them to help add their expertise to making the LSF content more accurate and current. So, specifying BBs and GPUs in LSF will be forthcoming.

As far as specifying nodes go, no, bsub does not have a direct analog to requesting nodes. They have a default slot definition of a core, and specifying tasks gets you that many cores - regardless of the nodes allocated to the job. There is a way to alter the default slot def, but I held off adding too much complexity to the table - to keep the minutia from clouding the message.

lipari commented 7 years ago

A fresh look at these requirements was added as https://github.com/flux-framework/distribution/issues/18.

flux-framework / distribution

smallest serviceable slurm substitute #6

(Pound) directive support in batch script (e.g., #SBATCH -N) as optional means to convey job specifications