job-manager: add priority plugin interface

garlick commented 3 years ago

A secondary priority value for jobs was described in #3256.

Presumably we would develop a job manager plugin of some sort to generate updates to this secondary priority based on multiple factors, one of which is the per-user fair share value from flux-accounting.

This issue is about specifying a mechanism for flux-accounting to communicate periodic updates of per-user fair share values to flux-core.

garlick commented 3 years ago

Here is one straw man proposal after thinking a bit about our conversation this afternoon.

Have the job manager keep a "active users" hash, with use counts per active job, and a key-value hash per user. RPCs could be implemented to:

list the current active users
set key=value for a given user
perform a bulk update of key=value for multiple users

The schema of keys could be unknown to the job manager, but a priority plugin, such as one that is fair share aware, could read specific per-user keys and involve them in calculation of a priority update.

When the job manager unloads, it could dump this user hash to the KVS, and reload it when it starts up again.

When flux accounting wants to recalculate its fair share values (say every hour via a cron job) it could

ask job-manager for current active user list
perform bulk update of fairshare=value for those users (future optimization: only those who have changed)

After the update, the plugin could run to recalculate the secondary priority for jobs in the queue.

garlick commented 3 years ago

That was kind of an overly detailed, job-manager centric thought bubble. Sorry about that.

If I could boil it down, I think my main thought above was that the job manager could implement a generic mechanism, such as a key/value store, for caching user data needed by a priority plugin, and that an external program (e.g. run by cron) could push that data into the job manager, limiting its updates to only the current active users. So the external program and an associated priority plugin would be aware of a particular "schema" for the data and control its update frequency. The job manager would be unaware of that schema. It would simply trigger the configured plugin to run when any of its inputs might have changed (possibly handing it a list of users and/or jobs). The plugin's updates to the secondary priority would in turn possibly trigger a move of a job within the queue.

I might still be too bogged down in the details (sorry!)

grondo commented 3 years ago

An alternative would be to cache data within the plugin itself. job-manager plugins could have a init callback which would (similar to shell plugins) allow them to register a service endpoint under the job-manager, only if necessary. Then the external part of the fairshare priority mechanism would send updates directly to this handler. This would allow job-manager plugins to keep state in a manner of their choosing. Other priority plugin types could setup other mechanisms for updating priority, e.g. set a timer_watcher, etc. Plugins could register a blob for the job-manager to store in the KVS when it is unloaded, or they could be required to handle this themselves.

The benefit here is less touching of the job-manager code, and better abstraction for priority plugins. The plugin architecture could even allow multiple plugins to be loaded at once for generically extending job-manager functionality. (i.e. priority plugin is just one type of job-manager plugin)

This approach may also allow more rapid development of the initial multifactor priority plugin since we do not first need to develop a generalized data caching mechanism for the job-manager proper.

Edit: I just mean to throw this idea out there as another idea. Apologies if I'm also getting us bogged down in the details.

garlick commented 3 years ago

That makes sense to me.

A requirements question is: do jobs need to have accurate secondary priorities assigned before they become eligible for scheduling?

grondo commented 3 years ago

Good question for @dongahn, but I would assume so. Then perhaps callback for priority plugin should be made before jobs are inserted into the queue to calculate initial secondary priority using current info.

dongahn commented 3 years ago

I think so. If the instance is configured to use the secondary priorities, a job must be assigned to a value before being scheduled.

dongahn commented 3 years ago

perform bulk update of fairshare=value for those users (future optimization: only those who have changed)

@grondo, @SteVwonder, @cmoussa1 and @milroy discussed this a bit at today's coffee hour.

Currently the fair share value is essentially equal to the global rank of users. So it is likely the fairshares of a majority of users will change for common cases at every invocation of fshare calculation. E.g., if the previously highest-priority job becomes the lowest priority for the current round, this single change will change the rank of every user.

Once we make reasonable progress with unoptimized update protocol, there are some techniques we can look into. I briefly mentioned this yesterday but perhaps the concept of the edit distance can be expanded. Introducing floating point schemes such a way that we minimize the changes to the previously calculated fairshare values can also be fruitful. As is, We're a bit tied to the normalized fair share values between [1.0, 0), which won't work well to support such an augmented scheme.

In flux-accounting, I will open up a ticket to augment libweighted_tree to incur minimal fairshare value changes.

BTW, @SteVwonder correctly noticed that solving this may also allow us to incur minimal changes on subsequent job-manager to fluxion-qmanager updates.

dongahn commented 3 years ago

@garlick and @grondo: For unoptimized update protocol, once the weighted tree walks the tree, it will have a sorted vector of users:

For example, in this simple test code

walk: https://github.com/flux-framework/flux-accounting/blob/master/src/fairness/weighted_tree/test/weighted_tree_test01.cpp#L78

iteration of ordered user vector: https://github.com/flux-framework/flux-accounting/blob/master/src/fairness/weighted_tree/test/weighted_tree_test01.cpp#L84

Perhaps we can agree on the proper format of this order user vector (that goes into flux-core) so that a program can be developed at flux-accounting in the near future? For example, essentially the payload schema for the upcoming RPC or similar?

I guess we may need to decide whether plugin will do this or job-manager will do this to decide on the schema... Or would it matter when the schema is likely a key-value set?

grondo commented 3 years ago

I guess we may need to decide whether plugin will do this or job-manager will do this to decide on the schema... Or would it matter when the schema is likely a key-value set?

Yes, I think this is the first design decision we should make. My opinion would be that we design an interface for a generic plugin architecture for the job-manager (perhaps priority-specific at this point) with an interface for plugins to register callbacks in the job-managers flux_t handle. This would allow the plugin to independently be developed in parallel (and wholly in flux-accounting) with the job-manager plugin interface in flux-core.

In any event, it would seem you could make considerable progress on priority plugin building blocks which could be tested using unit tests. Maybe seeing this development would inform a better design of the priority plugin interface in the job-manager. I still feel a bit in the dark on how the whole process fits together (e.g. job completion data from job-archive goes into accounting database, something calculates global fairshare vector, somehow the output of this vector creates a factor that is used in a multi-factor calculation to augment existing job priority values). It would allow flux-accounting to make good progress if you had some way to drive this process with mock data in flux-account unit and/or system tests.

dongahn commented 3 years ago

In any event, it would seem you could make considerable progress on priority plugin building blocks which could be tested using unit tests.

https://github.com/flux-framework/flux-accounting/pull/65 is the big part of this building blocks including unit tests. When you have a chance, please take a look at the unit tests (https://github.com/flux-framework/flux-accounting/pull/65/commits/746a6ce306ef164907b0c883c8031436ee63fd15) to let us know if it is along the line of what you are thinking about.

How about for me or @cmoussa1 to take the next step to create a mock program that uses this library and outputs a JSON with order vector of users? The program can either pass the output as an output file or similar (how do you do this with python job validator?) or sends the vector through an yet-to-be developed RPC -- in this case it would use flux_open

I still feel a bit in the dark on how the whole process fits together (e.g. job completion data from job-archive goes into accounting database,

This is the block that @cmoussa1 is developing under the guidance of @chu11. He should chime in with where he is.

something calculates global fairshare vector, somehow the output of this vector creates a factor

This is the block (libweighted_tree) that was developed as above.

that is used in a multi-factor calculation to augment existing job priority values).

This is the block that should be (co-)architected into flux-core, I think.

grondo commented 3 years ago

This sounds good to me @dongahn! I will take a look at https://github.com/flux-framework/flux-accounting/pull/65. Thanks!

dongahn commented 3 years ago

flux-framework/flux-accounting#65 is the big part of this building blocks including unit tests. When you have a chance, please take a look at the unit tests (flux-framework/flux-accounting@746a6ce) to let us know if it is along the line of what you are thinking about.

How about for me or @cmoussa1 to take the next step to create a mock program that uses this library and outputs a JSON with order vector of users? The program can either pass the output as an output file or similar (how do you do this with python job validator?) or sends the vector through an yet-to-be developed RPC -- in this case it would use flux_open

@cmoussa1: I can make today's 2PM coffee hour for the first 30 mins. Perhaps we should discuss the initial step there? Now that all of the PRs have been landed for flux-accounting, it would be good to generate what need to be done for the next release cycle.

cmoussa1 commented 3 years ago

I still feel a bit in the dark on how the whole process fits together (e.g. job completion data from job-archive goes into accounting database,

This is the block that @cmoussa1 is developing under the guidance of @chu11. He should chime in with where he is.

Before we shifted to working on first implementations for the weighted tree library, I was working on implementing a job usage calculation that utilizes @chu11's job-archive module (and that eventually makes its way into the flux-accounting database). I think I've made some good progress there, but I am sure there will be good feedback and suggestions to further optimize it. Maybe this week during the flux-accounting meeting I can talk about where I am at and see if this is indeed the right course of action.

@cmoussa1: I can make today's 2PM coffee hour for the first 30 mins. Perhaps we should discuss the initial step there? Now that all of the PRs have been landed for flux-accounting, it would be good to generate what need to be done for the next release cycle.

@dongahn - that sounds like a good plan. Talk to you then.

garlick commented 3 years ago

Let me try to summarize the architecture we discussed in today's 2pm coffee call, or how I am picturing it. Please comment!

The job manager would offer a "priority plugin API". This would be a published API, similar to the job shell plugin API, that would enable out of tree projects to provide job manager plugins that (a) can access various job attributes, and (b) can use that info, and other info perhaps, to set/update, the secondary priority of individual jobs.

The secondary priority would, in turn, be used by the job manager and schedulers to order the queue of alloc requests. Specifically, the order would be determined by (in descending precedence): 1) administrative priority, 2) secondary priority, 3) submit time. Aside: actually let's revisit this. Perhaps there should only be one priority that is calculated by the plugin and takes 1) and 3) as input.

The API will need to include a mechanism for the job manager to ask the plugin for an initial priority priority value for a new job. I think this "request" from the job manager to the plugin would need to occur and be satisfied on transition of a job to SCHED state, since the priority is required to be set before an alloc request is sent to the scheduler.

There would be another API interface that would allow the plugin to asynchronously update the secondary priority of a pending job, much like flux job priority allows the administrative priority to be updated.

Other plugin API interfaces that I think we can project will be necessary:

get job attribute: e.g. submit time, admin priority, job owner, current secondary priority, jobspec (resources section)
notification of jobs that are no longer pending
a mechanism for registering RPC sevices like the job shell, e.g. job-manager.priority-<method>.

We discussed that a priority plugin that uses the above interface to implement a "fair share" specific multi-factor priority calculation could be part of the flux-accounting project. Its priority calculation would take as inputs some of the above data from the job manager plugin API, and some data from flux-accounting, especially the user's fair share factor. This info would be pre-loaded by the plugin on initialization. Furthermore, since the fair share factors evolve over time, the plugin would need to set up a service method to accept updates periodically.

It would be useful to further flesh out the data that a priority plugin might require. We have a start on the factors needed for a fair share based multi-factor calculation in flux-framework/flux-accounting#8 although it is pretty slurm specific and needs to be fluxified.

dongahn commented 3 years ago

Thanks @garlick. This sounds like a great start.

The API will need to include a mechanism for the job manager to ask the plugin for an initial priority priority value for a new job. I think this "request" from the job manager to the plugin would need to occur and be satisfied on transition of a job to SCHED state, since the priority is required to be set before an alloc request is sent to the scheduler.

Perhaps we will need support for no plug-in or dummy plug-in support. For nested instances where plug-ins are not provided, job manager still wants to transition a job to SCHED with no or equal secondary priority.

get job attribute: e.g. submit time, admin priority, job owner, current secondary priority, jobspec (resources section)

It seems we need some thoughts about those job attributes that job-manager may not know immediately. Currently we set things like queue in the attr section of the job spec. Maybe the job spec can be passed to the multi factor priorities to help it harvest those scheduler specific attributes as well.

There would be another API interface that would allow the plugin to asynchronously update the secondary priority of a pending job, much like flux job priority allows the administrative priority to be updated.

Does it mean the multi-factor priority plugin needs to keep track of all of pending jobs? I initially thought that we can avoid this, by designing this asyn interface to operate at the user account level. But then job-manager probably doesn't even want to have the concept of bank accounts/users, so perhaps replicating job list within this plugin is unavoidable...

garlick commented 3 years ago

Good comments!

For nested instances where plug-ins are not provided, job manager still wants to transition a job to SCHED with no or equal secondary priority.

Agreed. I was assuming that the plugin would not be loaded by default and things would work as they do now. The system instance could be explicitly configured to load the fair share plugin. Alternatively (see aside above), we could load a default plugin that calculates a single priority value, taking as input the administrative priority and submit time.

It seems we need some thoughts about those job attributes that job-manager may not know immediately. Currently we set things like queue in the attr section of the job spec. Maybe the job spec can be passed to the multi factor priorities to help it harvest those scheduler specific attributes as well.

Passing the jobspec through may be a good option. I was thinking we would just pass the resources section to avoid the long environment but attributes surely would be needed too. Note that the job manager currently doesn't have the jobpsec in hand for each job so that may need some work.

Does it mean the multi-factor priority plugin needs to keep track of all of pending jobs?

Hmm, maybe, or maybe we could implement a query at the plugin API level for listing all jobs by owner?

dongahn commented 3 years ago

Hmm, maybe, or maybe we could implement a query at the plugin API level for listing all jobs by owner?

I like this idea. But in this case, we may need some good performance requirement for the query to return the list jobs own by the user. BTW, is the assumption here the user account seen by flux-accounting will be the same user ID maintained by flux-core?

garlick commented 3 years ago

we may need some good performance requirement for the query to return the list jobs own by the user.

Excellent point. If this plugin runs in the job manager's thread (what I was proposing) then it should not hold onto it for long or it will negatively impact job throughput. It will get control through its update RPC, when the job manager calls in to get priority for a new job, or when any of its own reactor watchers run. Any API call we offer should be designed to be fast even when the number of pending jobs is large.

When the plugin has a significant batch of work such as processing a new vector of fair share values for multiple users, it may need to interleave its work with letting the job manager run, for example using the prep/check/idle reactor idiom.

BTW, is the assumption here the user account seen by flux-accounting will be the same user ID maintained by flux-core?

Yes I think that is OK for now. At some point when the flux-accounting db is multi-cluster, and multiple sites are using it, we may need to map to an organization's unique ID in some way, but I don't think we need to worry about that right now.

dongahn commented 3 years ago

a mechanism for registering RPC sevices like the job shell, e.g. job-manager.priority-.

Is this so that the multi-factor priority (MFP) plugin client can do an RPC to the MFP plugin server or something else?

dongahn commented 3 years ago

It would be useful to further flesh out the data that a priority plugin might require. We have a start on the factors needed for a fair share based multi-factor calculation in flux-framework/flux-accounting#8 although it is pretty slurm specific and needs to be fluxified.

I'd ask @cmoussa1 to propose a minimal set that we should tackle for the first round. As part of that, we should get the terms right to be more appropriate for Flux.

grondo commented 3 years ago

The API will need to include a mechanism for the job manager to ask the plugin for an initial priority priority value for a new job. I think this "request" from the job manager to the plugin would need to occur and be satisfied on transition of a job to SCHED state, since the priority is required to be set before an alloc request is sent to the scheduler.

At first I was wondering if there was really a need for a separate call for the initial priority. If priority is a function of job parameters + some internal plugin state, then a single get-priority call should work the same for the initial priority vs updates.

However, on second thought, it might make things easier for a plugin developer to have a callback as a job-manager job enters each main job state (i.e. "init", "depend", "sched", "run", "cleanup", "inactive"). This would allow a priority plugin to hook in to the correct place to initialize its internal state (e.g. sometimes a priority plugin would want to create internal state for jobs even if they are in DEPEND, or keep that state while they are in CLEANUP). Any plugin callback after SCHED would not be able to update job data such as the priority, but a sophisticated plugin could use this data to update internal state that influences the priority of pending jobs). At all times, the "inactive" callback could be used to delete any internal job state (such as a count of running jobs for a given user, etc)

This kind of implies that the job-manager would drive the recalculation of priorities on a per-job basis, e.g. the job priority plugin interface would have callbacks that pass in a single job. Perhaps when internal state in a priority plugin has been updated, the plugin can set a flag in the job-manager to request a re-prioritization loop, which would be driven at the job-manager's discretion?

Sorry this may have gotten a little off-topic.

grondo commented 3 years ago

a mechanism for registering RPC sevices like the job shell, e.g. job-manager.priority-.

Is this so that the multi-factor priority (MFP) plugin client can do an RPC to the MFP plugin server or something else?

Yes, AIUI, the idea here is to allow the plugin to accept an update of data from an external source. In this case a provided script or cron job could periodically push updates of one or more factors to the plugin. The service registered by the plugin could also respond to a request for the current data values for debugging or informational purposes.

garlick commented 3 years ago

This kind of implies that the job-manager would drive the recalculation of priorities on a per-job basis, e.g. the job priority plugin interface would have callbacks that pass in a single job. Perhaps when internal state in a priority plugin has been updated, the plugin can set a flag in the job-manager to request a re-prioritization loop, which would be driven at the job-manager's discretion?

Ah I was thinking that the plugin would be required to select which jobs need to be updated. If we wanted to just let the plugin trigger a job manager iteration over all pending jobs when it receives updated factors, that would simplify things quite a bit:

no fast getjobsbyuser() lookup required
responsiveness of the job manager reactor during a bulk update could be managed in the job manager using prep/check/idle or similar, not in the plugin
a really dumb plugin could just implement the single getpriority() callback.

I hadn't thought of doing it that way but now that you suggest it, I really like it especially for a first cut since it will significantly simplify the plugin and its API.

One question: Is it reasonable to require that a getpriority() callback must always return a result immediately, or must we design some sort of async interface that allows the plugin to make RPCs? It would be a lot simpler design if we could assume an immediate response, but that would presume that any externally-sourced data is preloaded in the plugin at initialization (for example the fair share factors for all possible users).

garlick commented 3 years ago

Another way to go might be to add a new PRIORITY job state between DEPEND and SCHED. The job would transition from PRIORITY to SCHED once its priority is established. The priority plugin could watch for a job state transition into PRIORITY (or earlier) which would trigger it to fetch a user's data, if not in cache, and call a plugin API function to set the job's priority.

After that we could assume that a user's data is cached and drive priority updates from the job manager, expecting a getpriority() function to return immediately.

I guess the PRIORITY state is not really necessary and we could have another notification in SCHED state that the job needs its priority set, and hold back the alloc request until its done. However, I kind of like the idea of a new state here as if flux-accounting turns into a center resource, it might occasionally be slow or unavailable, and having jobs stuck in a PRIORITY state would make it pretty obvious what is going on. That also might be useful when the priority plugin is provided by a site, and the "why is flux slow" question could be answered if the time spent in the priority state is shown to be long.

grondo commented 3 years ago

I like the idea of the PRIORITY state. However, I think we should discourage a design where a priority plugin fetches something from a remote service or does a blocking operation in the common case to calculate the initial PRIORITY. The plugin should have enough information from the previous out-of-band update of internal state to calculate an initial priority for any new job.

However, I do like that the PRIORITY state allows more flexibility and would allow a plugin to be developed in this way if desired!

I also agree that priority updates should be instant (no RPCs, no blocking) in a well-behaved plugin. RPCs to update internal state should be handled either out-of-band, or I suppose could be scheduled via a timer watcher (with no blocking).

grondo commented 3 years ago

One reason to still allow a priority plugin to have callbacks even in non-PRIORITY states is that this will allow a plugin to update some of its internal state in-band, e.g. total count of queued jobs for a user, perhaps the number of running jobs for a given user, or some implementations may want to count jobs in CLEANUP, some may not, etc. By having a place to hook into the job-manager in all of these cases, we allow the plugin developer maximum flexibility in the implementation.

garlick commented 3 years ago

Oh yeah I meant to state agreement with the idea that it should be possible through the plugin API to register a callback on all job state transitions. In fact maybe this could just be a subscription to the already-defined job state transition event.

However, I do like that the PRIORITY state allows more flexibility and would allow a plugin to be developed in this way if desired!

It would let the fair share priority plugin be implemented different ways:

initially load fair share factors for all center users, and call setpriority() for a job immediately upon its entry to PRIORITY state. In the rare case that a user is not known at the time of startup, job could remain in PRIORITY state until the next periodic update of fair share factors brings in the user
initially load nothing, and leave jobs in PRIORITY until first periodic update of fair share factors primes the cache
initially load nothing, and upon entry into PRIORITY, fetch the user's fair share factor from flux-accounting, calling setpriority() once response is received.

In the latter case, fair share factors could be timed out of the cache if desired.

grondo commented 3 years ago

In fact maybe this could just be a subscription to the already-defined job state transition event.

I would consider not using an event subscription for this. It could make plugin development less straightforward since it would introduce races, and would require the plugin to parse the job state event payload. You would also need to provide some way for the plugin to fetch the job by id from the job-manager, whereas with a more typical plugin callback method, the job-manager can pass the job directly in to the plugin's callback method.

grondo commented 3 years ago

BTW, what we're moving towards here seems like it could be more powerful than just a job-manager priority plugin. If plugins have a hook into every job state transition, an interface for querying job properties including portions of jobspec, and an initialization hook that allows plugins to register message handlers and watchers on the job-manager's flux handle, then plugins could be written that do a lot more than manage administrative job priorities (just as an example, the job-manager "journal" could have likely been developed as a plugin)

As we design the interface, we could keep this in mind, though it is only a secondary use case. But if we can avoid creating priority-specific interfaces, then we keep the doors open to other job-manager extensions in the future.

garlick commented 3 years ago

As we design the interface, we could keep this in mind, though it is only a secondary use case. But if we can avoid creating priority-specific interfaces, then we keep the doors open to other job-manager extensions in the future.

Sure

(just as an example, the job-manager "journal" could have likely been developed as a plugin)

I was just about to disagree with you and point out what new hooks would be required to do that - then I realized we wouldn't even need the journal if we had done job-info as a job manager plugin. Maybe I shouldn't have said that - @chu11 might kill me.

Edit: Nix that thought! Let's stay focused.

dongahn commented 3 years ago

Great discussions. In terms of making lock-step progress w/ flux-accounting, do we think it is a good idea to do a skeleton RFC for the MFP plugin API? Once we can narrow down on the design details a bit using the RFC skeleton, perhaps flux-accounting can make progress based on that without actually having this API implemented within flux-core. Then, when flux-core team can make progress on this API, we can start our next-level round co-design/-implementation?

garlick commented 3 years ago

My plan is to open two RFC PRs for a start:

RFC 27 Sched Alloc Proto: add protocol for reprioritizing jobs
RFC 21 Job States & Events: add PRIORITY state

We can continue discussion on those details in the PRs.

dongahn commented 3 years ago

@garlick: yeah this is great! Thank you for doing this for us!

SteVwonder commented 3 years ago

With all of our other state transitions, they implicitly represent a "hand-off" of the job from one module "owner" to another. If we are adding a PRIORITY state, does it make sense to add a priority module, and then have flux-accounting provide a plugin to that module? It's definitely not necessary, but I just wanted to mention the pattern that already exists and ask if we want to continue it.

garlick commented 3 years ago

That's an interesting comment. Had to think a bit, but I believe we still are preserving the pattern for the most part. During the PRIORITY state we are waiting for an external entity (in this case a plugin possibly from another project) which might or might not need to phone home to flux-accounting or similar, so it feels appropriate to me for that to be event driven, and for the state to be visible since it will provide some insight if the phone home part is slow or stuck.

We could perhaps arrange things so that the plugin/service is another module that communicates only with messages, or the other option I was thinking about earlier was a python filter like the ingest validator. But it seems like a direct plugin might be easier to reason about and more efficient, otherwise a lot of job data might need to be transferred/cached in the other module/coprocess. Simple function calls turn into RPCs that have to be managed asynchronously, etc..

At least that's the way @grondo and I were leaning. The benefit of going the other way would be some isolation from bugs/slowness, especially in the coprocess case which can't crash the broker even if it segfaults.

SteVwonder commented 3 years ago

But it seems like a direct plugin might be easier to reason about and more efficient, otherwise a lot of job data might need to be transferred/cached in the other module/coprocess. Simple function calls turn into RPCs that have to be managed asynchronously, etc..

Yeah, I agree this is a massive overhead (developer and message-wise) to make it a separate module. This is a more tightly coupled use case than say the dependency module. I think the reasoning is sound for making it a plugin of the job manager, just wanted to toss it out there to make sure we weren't missing anything.

chu11 commented 3 years ago

a subtlety in the above discussion that perhaps I missed (or perhaps hasn't been determined). When a job enters the PRIORITY state, it asks the plugin for the initial queue priority. But what about any priority changes after that?

Does the job-manager call the priority plugin every once in awhile? perhaps at configured intervals
When it calls the priority plugin, does it send 1 job at a time, or all the jobs currently in the queue? I have the feeling it has to be the latter?

grondo commented 3 years ago

But what about any priority changes after that?

AIUI, we were considering two interfaces here:

a simple callback into the plugin which given a job returns a priority. This would be used in place of any default job_update_priority() or similar function in the job manager. In this case the function would be called once per job, whenever the job manager wanted to update a job's priority. A limitation would be that the function to calculate the priority is understood not to block.
a setpriority() call that the plugin could call asynchronously on any job. The job manager would then perhaps notify the priority plugin of job/jobs that need a priority update, and the plugin could either synchronously or asynchronously update the job or job's prioirty.

I would argue for the first case above initially. This approach is much simpler and allows the job-manager full control over when job priorities are updated. The plugin could update its internal state asynchronously (via RPCs, timer watchers, or a service method to accept updates), so that the job_update_priority() call is always just a simple calculation. The job manager plugin API could offer a method to allow the plugin to trigger a reprioritization.

For the second case, the plugin would have control over when job priorities are updated, and thus the job-manager would have to have some slightly sophisticated code to batch priority updates before acting on them, and if it wanted to trigger a priority recalculation for one or more jobs, it would need code to be able to handle this asynchronously.

chu11 commented 3 years ago

The job manager plugin API could offer a method to allow the plugin to trigger a reprioritization.

Sounds good. This was the glue piece that I did not see.

I was thinking of a basic priority plugin that mirrors what the job-manager does right now, where jobs are ordered by admin priority first, t_submit second. If the user changes the admin priority for a job via flux job priority, it could cause many jobs to have their priorities changed to ensure ordering is correct.

grondo commented 3 years ago

If the user changes the admin priority for a job via flux job priority, it could cause many jobs to have their priorities changed to ensure ordering is correct.

Doesn't the job-manager use zlistx_t for the ordered queue? If a job has its priority updated, it should be moved simply with zlistx_reorder() (perhaps that is already done in the job-manager since updating priority is already supported?)

grondo commented 3 years ago

Before any work starts on this, we should probably have proposed API for job-manager plugins and make sure it will cover our basic use cases. We can then consider how and when the job-manager will call each defined plugin callback to make sure the plugins have the required hooks, and the priority updates will work for the intended cases.

chu11 commented 3 years ago

Doesn't the job-manager use zlistx_t for the ordered queue?

Yes

If a job has its priority updated, it should be moved simply with zlistx_reorder() (perhaps that is already done in the job-manager since updating priority is already supported?)

My understanding is that with the priority plugin, queue priorities become the only data used for ordering, t_submit is no longer used in ordering. Possibly bad example: a bunch of jobs are assigned queue priorities of 100, 99, 98, 97, 96. Then due to a admin priority change and the specific t_submit of that job, a job needs to go between the job with queue priority 100 & 99. Don't all of the other jobs have to have their priorities adjusted as well?

grondo commented 3 years ago

Don't all of the other jobs have to have their priorities adjusted as well?

I think you answered your own question. In the simple case the priority value is a function of t_submit and admin priority, not a function of other jobs in the queue. Simply, you calculate a new priority for a job and then insert that job into the queue in the order dictated by that priority.

a bunch of jobs are assigned queue priorities of 100, 99, 98, 97, 96. Then due to a admin priority change and the specific t_submit of that job, a job needs to go between the job with queue priority 100 & 99. Don't all of the other jobs have to have their priorities adjusted as well?

I don't think the priority values of adjacent jobs would ever be sequential, and priority "ties" would be almost astronomically unlikely (t_submit as a factor in the priority calculation should eliminate that possibility). That is why the queue priority is given a huge range (or alternately a floating-point value between 0. and 1.0, though we've chosen not to do that here)

chu11 commented 3 years ago

I don't think the priority values of adjacent jobs would ever be sequential, and priority "ties" would be almost astronomically unlikely (t_submit as a factor in the priority calculation should eliminate that possibility). That is why the queue priority is given a huge range (or alternately a floating-point value between 0. and 1.0, though we've chosen not to do that here)

Ok, perhaps I'm thinking more about the corner cases rather than the average case. Perhaps I need to see how how some other simple-ish schedulers operate, then things would be more obvious to me.

garlick commented 3 years ago

Agree with @grondo here.

I don't think we want to assert that priorities must be unique though. Maybe practically speaking we do need to be prepared to break ties deterministically, like order on priority, then job ID. That would be a good subtlety to point out in an RFC at some point, if it makes sense.

grondo commented 3 years ago

It might have been less confusing if we kept priority as a floating point number between 0. and 1.0?

grondo commented 3 years ago

like order on priority, then job ID.

Good point, I didn't mean that priority has to be unique, just that it is a function of factors that are consider on a per-job basis, not based on other jobs (except where they may influence those factors out-of-band, e.g. total number of queued jobs for a user). And the resulting priority value should have a large enough range to avoid ties becoming common.

BTW, by sorting ties on jobid second, there is a not-so-hidden benefit to submitting your job on the lowest rank possible ;-)

(edit: to go back to @chu11's example above, even if you did try to re-prioritize the existing jobs in the queue, if none of the factors had changed, then the priority value will not change, so it doesn't do any good to reprioritize all jobs when one job is updated.)

dongahn commented 3 years ago

a simple callback into the plugin which given a job returns a priority. This would be used in place of any default job_update_priority() or similar function in the job manager. In this case the function would be called once per job, whenever the job manager wanted to update a job's priority. A limitation would be that the function to calculate the priority is understood not to block.

a setpriority() call that the plugin could call asynchronously on any job. The job manager would then perhaps notify the priority plugin of job/jobs that need a priority update, and the plugin could either synchronously or asynchronously update the job or job's prioirty.

One comment: I think it will be useful to design the plugin API so that the plugin itself doesn't have to keep its own pending job queues. The duplication of states like this can lead to higher complexity,

dongahn commented 3 years ago

It might have been less confusing if we kept priority as a floating point number between 0. and 1.0?

I fear normalizing the priority to 0. - 1.0 can lead to an issue down the load when we optimize to minimize the number of updates (e.g., reusing the old fair share values as much as possible: https://github.com/flux-framework/flux-accounting/issues/69).

A question: do you see the plugin API will pass the primary priority down to the plugin so that it can incorporate that into the resulting priority? Or that is the responsibility of the job manager and the plugin should just concern the secondary priority. It seems the latter is more straightforward.

grondo commented 3 years ago

One comment: I think it will be useful to design the plugin API so that the plugin itself doesn't have to keep its own pending job queues. The duplication of states like this can lead to higher complexity,

Yeah, I had assumed the plugin would not keep its own queue, and would instead keep the minimum state necessary for whatever factors it may be using to compute the priority value of any job. In many cases, the plugin may not need to keep any state if it is only using factors that are part of a job.

flux-framework / flux-core

job-manager: add priority plugin interface #3311