SteVwonder commented 4 years ago

Related to #286, #287, #529, and #637.

As it stands now with Slurm and other schedulers, you can set limits on things like walltime, job size, simultaneous numbers of submitted/running jobs (controlled independently), simultaneous number of nodes used at the granularity of a user, a partition/queue, or a QoS (i.e., job(s) with a particular label applied).

We need to decide A) which of these limits we want to support B) where we want to enforce these limits.

637 lays out our current plans for implementing multiple partitions/queues and implicit in that is controlling the number of nodes assigned to each partition/queue.

I'm not sure if we've documented it anywhere, but we can pretty easily handle the walltime and job size limits at the job-ingest validator. Note: if we enable modifying jobspec post-submission, we will need to re-validate. If anyone disagrees, we should open a separate issue to discuss there.

At the end of #529, it sounds like we want to avoid QoS if possible and instead leverage overlapping queues/partitions. I believe the idea there is to have many queues, each with their own limits, and then when you want to emulate the expedite QoS, you would move the job from the batch queue to the expedite queue (if we want to go that route, we should open a ticket to track it since that will require editing jobspec or a new interface to qmanager).

So the gaps that I see are:

Restricting the maximum number of simultaneously running jobs for a user
- Also technically for a queue too, but would you ever want to restrict the number of running jobs in a queue?
- Should this user restriction apply across the cluster or be per queue? At LC, we look at the number of jobs a user is running in debug (and ignore batch).
Restricting the maximum number of simultaneously submitted jobs for a user
- Also technically for a queue too
- Same as above: should this user restriction apply across the cluster or be per queue? If it's across the cluster, then job-manager could handle this (it already tracks how many jobs a user has pending).
Restricting the maximum number of simultaneously used nodes within a queue for a user
- Again used at LC to restrict the number of nodes a user is using in debug. Not sure about other use-cases.

All of those gaps look to be for controlling the behavior of a "bad actor" (could be trying to exploit debug queue for production work or could be unaware that they are submitting to the wrong queue, too many jobs, etc.) So ultimately, I think this is a relatively low priority item to get implemented during our rollout, but I wanted to bring this up so we at least have it in mind as we design other policy limiting functionality.

dongahn commented 4 years ago

@SteVwonder: I believe this should be a topic we need to discuss at a coffee hour. Maybe we should do this at 2PM today.

SteVwonder commented 4 years ago

Summarizing the coffee time discussion relevant to this issue:

Restricting the number of jobs/nodes per user (either per system or per queue) AND still doing reservations would require pushing the logic down into the resource module and would be expensive in terms of scheduling cost
As a cheaper first cut, we want to start with limiting the number of simultaneously running jobs at the qmanager level. The qmanager will stop considering a user's jobs in the queue once that limit is reached (i.e., their other jobs won't get reservations made until they fall back below the limit).

In terms of how to handle static, per-queue limits, that is a separate issue (https://github.com/flux-framework/flux-sched/issues/642).

dongahn commented 4 years ago

I'd like to summarize our design space for adding limit support in a table.

Folks, please help grow/refine the table. As we discussed, there appear to be two different categories of limits. So I used it as the first dimension in our taxonomy and classify them into static vs. dynamic first.

Then, for each category, there seems to be two major kinds. So I added multi-queue aware vs. multi-queue-agnostic into the second column.

For each kind, we have actual limits. I only capture two static limits (Max job size and walltime) and two dynamic limits (Max running jobs per use and Max aggregate resources per user). If there are other limits we should consider, please grow the table.

I also added a mechanism column to capture how each limit should be implemented based on the discussions so far. And I also added Day 1 column to prioritize which needs to be or can be done for our next milestone.

Finally, I added the proposed semantics when a job hit the corresponding limit.

Limit Category	Multi-Queue	Limit	Mechanism	Handling	Day 1
Static	Multi-Queue Agnostic	?	?	?	?
		?	?	?	?
	Multi-Queue Aware	Max job size	qmanager assisted job-ingest plugin	reject job	:heavy_check_mark:
		Max Walltime	qmanager assisted job-ingest plugin	reject job	:heavy_check_mark:
Dynamic	Multi-Queue Agnostic	Max running jobs per user	fairshare (?)	none (?)	?
		Make aggregate resources per user	fairshare (?)	none (?)	?
	Multi-Queue Aware	Max running jobs per user	qmanager	skip job	:heavy_check_mark:
		Max aggregate resources per user	qmanager	skip job	:heavy_check_mark:

Edit 1: Use "skip" for the handling semantics of dynamic limit.

dongahn commented 4 years ago

@cmoussa1 or @SteVwonder: how does SLURM handle a job that exceeds the dynamic limits? If it rejects the job, then we can model after it and just don't have to worry about that weird interplay between the queuing policy vs. limits.

But I added reject (or skip) for now.

Since we are talking about a "limit", it may just make sense to reject the job. Just like OS nproc limit will reject a process from a user once the user exceeds that limit, we may as well reject the job for simplicity and clarity in our limit handling semantics. Just a thought.

SteVwonder commented 4 years ago

Thanks Dong for the table, I think that helps summarize the situation.

Do we want to support muli-queue aware max aggregate resources per user on day 1? After our recent coffee time discussion, I thought that would require too much complexity and runtime cost for us to be comfortable handling that day 1. IIUC, this would require the qmanager to traverse the entire resource section of jobspec to calculate the aggregate resource requirements or for us to push the logic down into the resource module via "virtual user resources".

how does SLURM handle a job that exceeds the dynamic limits

From the Slurm docs: MaxJobs= The total number of jobs able to run at any given time for the given association. If this limit is reached new jobs will be queued but only allowed to run after previous jobs complete from the association. For the dynamic limits, that seems to me like the right behavior from a user perspective; if I'm only allowed to run 10 jobs at a time, that shouldn't restrict me from submitting and queuing up 20 and then letting them run 10 at a time.

Personally, I'm view the dynamic limits as more of a "throttling" or a "soft limit" since the system state can change to allow the jobs to run successfully in the future, and then static ones are a "hard limit" (i.e., this job is never going to run with this particular combination of user, queue, and resources). That may not be the right way to think about though; just sharing my mental model.

dongahn commented 4 years ago

Personally, I'm view the dynamic limits as more of a "throttling" or a "soft limit" since the system state can change to allow the jobs to run successfully in the future, and then static ones are a "hard limit" (i.e., this job is never going to run with this particular combination of user, queue, and resources). That may not be the right way to think about though; just sharing my mental model.

OK. Thanks. Good to know.

I think what makes sense is to grow the table above with a clean taxonomy consisting of limit classes, limits and handling semantics and design/implement our solutions accordingly with "consistent behavior". Even from our discussions, there has been lots of confusion :-) Then we should be able to describe our system in terms of classes of limits instead of each individual limit. The latter is pretty ad hoc.

At the end of the day, we may need an RFC. Seems the first set of handling semantics appears from our discussion is:

A job will be immediately rejected when it exceeds a static limit
A job will remain in a queue when the user exceeds a dynamic limit, but it will have no impact to the scheduling policies whatsoever. (IOW, the scheduler will behave as if it doesn't exist in the queue.)

SteVwonder commented 4 years ago

@dongahn: per your question on the coffee call about does Slurm make reservations for jobs that have exceeded a dynamic limit, the answer (AFAICT from digging through the source code), is no, they do not.

In their backfill plugin, they check if any dynamic limits are exceeded here and here, and it's not for another couple of hundred lines later that they attempt any backfilling starting here.

So if we "skip" jobs once a user exceeds their dynamic limit, I believe our behavior will be in line with Slurm's.

dongahn commented 4 years ago

Do we want to support muli-queue aware max aggregate resources per user on day 1? After our recent coffee time discussion, I thought that would require too much complexity and runtime cost for us to be comfortable handling that day 1. IIUC, this would require the qmanager to traverse the entire resource section of jobspec to calculate the aggregate resource requirements or for us to push the logic down into the resource module via "virtual user resources".

I don't know how to do this yet but I thought it would be good to target it day 1 if possible.

A path: resource can index allocation per user and keep some summary information of Rs there. As part of a new successful match_allocate, if adding new R to this summary exceed the limit, it can return "limit" errno and invalidate the new allocation. Receiving, qmanager can mark the state of this user "limit hit".

At this point, we don't pass user_id into resource so that needs to be done. (probably easy). I will have to augment the match allocate RPC anyway to pass "queue label" and I can do this as part of that. A part that needs some thought is this resource summary information scheme. We want it such that the aggregate resource limit can be specified in various forms (# node or #core or #node and #cores...)

Edit: the initial posting was so poorly worded. Sorry.

dongahn commented 4 years ago

@dongahn: per your question on the coffee call about does Slurm make reservations for jobs that have exceeded a dynamic limit, the answer (AFAICT from digging through the source code), is no, they do not.

In their backfill plugin, they check if any dynamic limits are exceeded here and here, and it's not for another couple of hundred lines later that they attempt any backfilling starting here.

So if we "skip" jobs once a user exceeds their dynamic limit, I believe our behavior will be in line with Slurm's.

Yes, this would be a sane way to handle this. It would be very difficult to reason about the effect of a queuing policy when it is combined with some other "semi" scheduling like limits. And that was my fear.

dongahn commented 4 years ago

Large function, that is...

cmoussa1 commented 4 years ago

IIUC, in terms of the handling of jobs with regard to fairshare, once a resource limit has been reached (allocated_nodes * time_in_seconds is the default way to calculate resource usage I believe), the user's access to a machine doesn't get cut off, but rather, future jobs submitted by that same user will wait to run until after jobs that have been charged to other under-serviced accounts have finished running.

I'm not sure I have a good answer/suggestion as to how we want to handle this come day 1, but I thought I'd at least share what I know about Slurm's approach to handling.

dongahn commented 4 years ago

Thanks @cmoussa1:

I believe you are referring to the following rows.

Limit Category	Multi-Queue	Limit	Mechanism	Handling	Day 1
Dynamic	Multi-Queue Agnostic	Max running jobs per user	fairshare (?)	none (?)	?
		Make aggregate resources per user	fairshare (?)	none (?)	?

With respect to your comment:

future jobs submitted by that same user will wait to run until after jobs that have been charged to other under-serviced accounts have finished running

Isn't this what we want as to how to handle this limit? Are you saying this will be done as part of fairshare? Maybe I'm missing something.

cmoussa1 commented 4 years ago

@dongahn yes, those are the two rows I'm referring to! 🙂 I think since the attributes required could be found in @chu11's job-archive, they could be fed into the fairshare calculations. @chu11 - it's possible to get nodes and an elapsed time for a user's jobs, right?

dongahn commented 4 years ago

@dongahn yes, those are the two rows I'm referring to! 🙂 I think since the attributes required could be found in @chu11's job-archive, they could be fed into the fairshare calculations.

At this point, I am really wondering about the property of fairshare changes.

Does the fairshare of a user change such a way that a user can monopolize the system? This probably is a function of a user's group and jobs that are currently queued up. Under what condition, then a user can monopolize the system per his/her fairshare? Does this condition occur frequently?

Maybe someone can build an analytic model or similar to reason about this space?

cmoussa1 commented 4 years ago

Just to address your question above in some writing, @dongahn: No matter how many jobs a user submits their priorities are always changing based on their usage.

We talked two days ago about trying to design a set of static limits that could be used to generate job priorities. So, I've started looking at some static limits that can be configured to generate an integer priority for a job. So far, this is what I've come up with:

Job Size: the number of nodes/CPUs a job is allocated
Partition: the factor associated with each node partition (batch, debug, etc.)
Nice: a user-controlled factor that allows users to prioritize their own jobs

These three factors would be used to calculate a job priority p:

p = (PriorityWeightJobSize) * job_size_factor +
    (PriorityWeightPartition) * partition_factor -
    nice_factor

Each of the factors are floating point numbers that range from 0.0 to 1.0. The weights are unsigned integers which help determine which factor we want to place more emphasis on. This results in a job priority p, an integer that is greater than 0. The larger the number, the higher the job will be positioned in the queue, and the sooner the job will be scheduled.

Here's a little more explanation on the static factors that would be used in the priority calculation:

Job Size

This correlates to the number of nodes/CPUs the job has requested. This could be configured to favor larger jobs or smaller jobs.

Partition

Each node partition could be assigned an integer priority (e.g. the debug queue has an integer priority of 100, but the batch queue has an integer priority of 500, or vice versa). The larger the number, the greater the job priority will be for jobs that request to run in this partition. This priority value is then normalized to the highest priority of all the partitions to become the partition factor.

Nice

Users can adjust the priority of their own jobs by setting a nice value on their jobs. Positive values negatively impact a job's priority. Only privileged users can specify a negative value. The higher the positive value, the more it negatively impacts their job's priority.

Other factors, like age and fairshare, are dynamic limits, so I did not include them in the equation above. There are other static factors, like a Trackable Resource (TRES for short) factor, where each TRES has its own factor for a job which represents the number of allocated/requested TRES type in a given partition.

There is also the QOS factor, which allows each QOS to be assigned an integer priority. The larger the number, the greater the job priority will be for jobs that request this QOS (similar to the partition factor). IIRC, we weren't sure if we wanted to include both partitions and QOS's. An idea that I have (which is probably naive, but I figured I'd throw it out there), is to define a QOS-configurable attribute within a partition.

For example, in Slurm right now, you could have a list of partitions (debug, batch, all, etc.), and a list of QOS (normal, standby, expedite, etc.). We could instead define a QOS field in each partition, which would allow for users to specify a QOS when they submit jobs, or admins to add a QOS on a user's submitted job in a specific partition. Here's an example to try and visualize what I am thinking:

debug wouldn't have any configurable QOS's in its partition. If a job is submitted, it is assigned its priority and is not allowed to be expedited, placed on standby, anything.

batch could have submitted jobs that could be assigned a QOS either at submit time or while it sits in a queue, to expedite its priority or place it on standby.

Like I mentioned above, this is just me tossing around an idea, and I'm not entirely sure if it's feasible or well thought-out. I don't have any experience with submitting jobs with a QOS. I think this approach allows us to keep just one partition_factor instead of both a partition_factor and a qos_factor.

Hopefully this made some sense and is at least a start for us to narrow down the user policy limits!

SteVwonder commented 4 years ago

Thanks @cmoussa1 for the write up.

Nice: a user-controlled factor that allows users to prioritize their own jobs

Is this nice factor the priority factor that is already implemented in the job-manager, or is it something separate?

I ask because I always thought of the priority in job-manager as a sort of "priority class" where the highest priority jobs get serviced/considered first before lower class jobs are. Similar to how one of the network schedulers is linux works [1]:

When dequeuing for sending to the network device, CBQ decides which of its classes will be allowed to send. It does so with a Weighted Round Robin process in which each class with packets gets a chance to send in turn. The WRR process starts by asking the highest priority classes (lowest numerically - highest semantically) for packets, and will continue to do so until they have no more data to offer, in which case the process repeats for lower priorities.

PS - Not to say that this is the "right way" to think about the nice priority, I just want make sure I'm aligning my mental model with the discussion and the rest of the team's model.

grondo commented 4 years ago

I ask because I always thought of the priority in job-manager as a sort of "priority class"

To me, that sounds like how it works now, since jobs are ordered first by priority, then submit time. (If there is a difference between just sorting by priority,time and the WRR scheme described above, then I don't get it).

Whether it stays that way depends on if the current "submit priority" becomes an input factor in the priority output by a priority plugin, or if a separate priority is generated and the job-manager continues to sort on submit priority first, then secondary priority (of which submit_time would presumably be a factor). (I realize after typing that this is obvious, but it helped to type it up)

It seems there are benefits to either approach.

If we keep the primary (submit) priority as a separate, primary sort key, then things like job hold and expedite could be more easily implemented as a simple adjustment of this one priority.

If the submit priority is just one factor in the final priority, though, then I think it does satisfy the use case for a "nice" value since it is already adjustable by the user (lower only). However, then we'd need a different method for hold/expedite.

If we keep the primary priority, I can't imagine a use case for 32 "priority classes", where the jobs of each class always run before all other jobs of the lower class. This would mean that, in a typical case, if a user submitted a job with a priority 1 lower than the default, it would run after all other jobs on the system...

cmoussa1 commented 4 years ago

If we keep the primary (submit) priority as a separate, primary sort key, then things like job hold and expedite could be more easily implemented as a simple adjustment of this one priority.

Would this mean that a nice factor would be left out of a final priority calculation? If so, as far as static limits go, I think that would leave the following:

p = (PriorityWeightJobSize) * job_size_factor + (PriorityWeightPartition) * partition_factor

Would a primary sort key adjust job priorities after its initial priority p is calculated?

dongahn commented 4 years ago

@cmoussa1:

Thank you for furthering this discussion and sorry I'm coming at this late.

We talked two days ago about trying to design a set of static limits that could be used to generate job priorities.

My apology. I got a bit confused. Do the static limits affect job priority calculation other than "if your job exceed a limit, your job will not be rejected or not be scheduled"?

It seems you are referring to static priority factors?

dongahn commented 4 years ago

Job Size: the number of nodes/CPUs a job is allocated Partition: the factor associated with each node partition (batch, debug, etc.)

Yeah, considering those two properties as the static factors of priority calculation makes sense to me. One nit: we use the notion of multiple "queues" instead of "partitions" as this term conveys the concept of overlapping resource sets a bit better. So my preference is to use "queues" for this as well.

So my preference is to use "queue".

dongahn commented 4 years ago

W/ respect to the formula

p = (PriorityWeightJobSize) * job_size_factor + (PriorityWeightPartition) * partition_factor

Is the proposal to use this as one of the sorting criteria at the job-manager level?

I had to think through, but it may work at that level. Even if the jobs are sorted at job-manager, the sorted order of jobs will be such a way that, when they are enqueued into multiple queues within qmanager, the relative job order within each queue will remain constant.

In my initially mental model on how the jobs flow through multiple queues and are sorted:

enqueud into job-manager (j1, j2, ..., j10) and sorted in submit time and priority class (j2, j1 ..., j10)
then, enqueued across multiple queues (say batch and debug) within qmanager
- debug: j2, j8
- batch: j1, j3... j10
Each queue within qmanager will use p calculation for each of "its" jobs to sort them and then apply its own queuing policies on them (fcfs, easy).

At least with the static priority factors, whether we do that sorting at the job-manager level or at the qmanager level, the job order as seen by each queue within qmanager will be the same as far as we use unlimited mode.

Two issues:

How will our priority calculation plugin get the "partition" (or "queue") information when it is inherently qmanager info.
The job-order invariant property may not hold for general cases when we include dynamic" priority factors. It would be good to pick one or two dynamic factor and see if this nice invariant property can still hold?

dongahn commented 4 years ago

@cmoussa1: somehow this ticket has been morphed into a static priority factor discussion from a policy limit discussion. Perhaps we should move the priority factor discussion piece into a new ticket. I don't mind if this goes to flux-accounting?

cmoussa1 commented 4 years ago

My apology. I got a bit confused. Do the static limits affect job priority calculation other than "if your job exceed a limit, your job will not be rejected or not be scheduled"?

I don't believe they do. I guess the accounting side only contains static limits, not static priority factors. As of now, flux-accounting contains:

shares: number of shares allocated to the user
max jobs: max number of jobs a user can have running simultaneously
max wall time per job: max number of wall time a user's job can occupy

If, on submission, at least one of these static limits is exceeded, then a user's job will not be scheduled until all three limits are satisfied.

One nit: we use the notion of multiple "queues" instead of "partitions" as this term conveys the concept of overlapping resource sets a bit better. So my preference is to use "queues" for this as well.

Understood, my mistake!

Perhaps we should move the priority factor discussion piece into a new ticket. I don't mind if this goes to flux-accounting?

I can move this discussion to flux-framework/flux-accounting #9.

garlick commented 2 years ago

This issue is more broad than the topic implies, but now that we have nailed down

rfc33 static limits configuration (global and per-queue) and enforcement at job ingest
urgency (aka submit priority) is not treated as a priority class by the job manager (hopefully not different in fluxion?), but instead is a factor used by mf_priority to determine the priority, which directly determines the queue order

It seems like these problems are less wide-ranging now we could open narrower scope issues in core, accounting, and sched as needed. (Please reopen if I"m mistaken)

flux-framework / flux-sched

Enforcing user policy limits #638

637 lays out our current plans for implementing multiple partitions/queues and implicit in that is controlling the number of nodes assigned to each partition/queue.

Job Size

Partition

Nice