flux-framework / flux-core

core services for the Flux resource management framework
GNU Lesser General Public License v3.0
159 stars 49 forks source link

use cases for partition + qos or queues #4306

Open ryanday36 opened 2 years ago

ryanday36 commented 2 years ago

Here are the basic things that we do with the combination of partitions and qos in Slurm, and just queues in LSF:

  1. set default limits on all user jobs that are trying to access a given set of nodes. E.g. all users with a current bank have access to a ‘pbatch’ partition + ‘normal’ qos (Slurm) or just ‘pbatch’ queue (LSF) which is a set of nodes with certain limits on max job size, time limit, etc. as well as a baseline priority factor. They will also have access to a ‘pdebug’ partition (Slurm) or queue (LSF) which is a different set of nodes with different limits.
  2. allow some users to ignore those limits. E.g. users / jobs may be given access to an ‘exempt’ qos (Slurm) which overrides the pbatch partition limits or an ‘exempt’ queue (LSF) which starts jobs on the same set of nodes as pbatch, but doesn’t have limits on job size, time limit, etc.
  3. give users a big priority boost. E.g. users / jobs may be given access to an ‘expedite’ qos (Slurm) or ‘expedite’ queue (LSF) which does the same thing as ‘exempt’, but also has a higher baseline priority factor than the ‘normal’ qos (Slurm) or ‘pbatch’ queue (LSF).
  4. restrict access to a given set of nodes. E.g. we often define a ‘pall’ partition (Slurm) or queue (LSF) that has all of the nodes and only give specific users access to it during a DAT.
  5. allow some user jobs to be pre-empted. E.g. expired banks may still be given access to the ‘pbatch’ queue + a ‘standby’ qos (Slurm) or just the ‘standby’ queue (LSF). Jobs in this queue / qos are not subject to the same limits as the ‘normal/exempt/expedite’ qos or queues, but they have a lower baseline priority factor and they can be pre-empted (cancelled) by jobs that are submitted to the other queues.
  6. (stretch goal, not something we currently do at LLNL) maintain a dynamic pool of nodes for interactive use. E.g. this is something that LANL is doing for debug/interactive jobs and is looking at using for CI jobs, but we haven’t implemented here at LLNL. It actually uses a dynamic reservation in Slurm rather than queues or qos, but I could see it being implemented with queues. The idea is that there are a small number of nodes in a reservation that can only be used by short, small jobs. When a job starts on that reservation, idle nodes get added to the reservation (up to some maximum size) so that there are effectively always some idle nodes available for interactive / debug use.

Generally, users or user+bank combos are given access to specific queues / qos by administrators and can then submit their jobs directly to those queues / with that qos. Administrators can also change the queue / qos of a specific job even if the user doesn’t otherwise have access to that queue / qos.

dongahn commented 2 years ago

This is outstanding. I'm summarizing this in my multi-level queue scheduler architecture as we speak. Thanks @ryanday36!