job-manager: add priority plugin interface

garlick commented 3 years ago

A secondary priority value for jobs was described in #3256.

Presumably we would develop a job manager plugin of some sort to generate updates to this secondary priority based on multiple factors, one of which is the per-user fair share value from flux-accounting.

This issue is about specifying a mechanism for flux-accounting to communicate periodic updates of per-user fair share values to flux-core.

garlick commented 3 years ago

A question: do you see the plugin API will pass the primary priority down to the plugin so that it can incorporate that into the resulting priority? Or that is the responsibility of the job manager and the plugin should just concern the secondary priority. It seems the latter is more straightforward.

We're getting away from primary and secondary priority as terms, but yes I think the administrative priority and the previous value of the queue priority would be available to the plugin via some kind of get function.

dongahn commented 3 years ago

BTW, by sorting ties on jobid second, there is a not-so-hidden benefit to submitting your job on the lowest rank possible ;-)

I would argue we use submit time as our tie breaker since at least we can explain this to users more easily and I have to think this will have least bias. However, since our FLUID is loosely ordered by submit time, it may work just as well. But in case we need to explain this to users, we should say "submit time" since it is earlier to grok.

grondo commented 3 years ago

I fear normalizing the priority to 0. - 1.0 can lead to an issue down the load when we optimize to minimize the number of updates (e.g., reusing the old fair share values as much as possible: flux-framework/flux-accounting#69).

Yeah, sorry I was mainly proposing that as a thought experiment as I think it would show the unlikelihood of the corner case example. However, we should point out the distinction between fair share values that the plugin might be using as one of the factors in the calculation of priority. How the fair share table gets updated in the plugin is a different issue, right?

dongahn commented 3 years ago

We're getting away from primary and secondary priority as terms, but yes I think the administrative priority and the previous value of the queue priority would be available to the plugin via some kind of get function.

I think my main question is whether the plugin should be responsible for incorporating the administrative priority of the job into the resulting priority. If this should be the case, we need a pretty disciplined way in the resulting priority. (e.g., using the high order bits of the result for the admin priority and low order bits for secondary priority).

grondo commented 3 years ago

I think my main question is whether the plugin should be responsible for incorporating the administrative priority of the job into the resulting priority.

I think yes! This is one of the important factors of a multi-factor priority right?

If this should be the case, we need a pretty disciplined way in the resulting priority. (e.g., using the high order bits of the result for the admin priority and low order bits for secondary priority).

This part confused me though, what do you mean by secondary priority here?

Would it help if we renamed the "admin priority" something like the "nice" value? (thought that would reverse its meaning)

Then there is just one priority, which is always calculated.

garlick commented 3 years ago

I would argue we use submit time as our tie breaker since at least we can explain this to users more easily

I suggested FLUIDs because they are guaranteed unique. It's possible to still have a tie with t_submit.

dongahn commented 3 years ago

It's possible to still have a tie with t_submit.

Huh. I didn't know. I assumed each job will get to a central location at which point we time stamp. Clearly this assumption was incorrect... (t_submit is time stamped by the local broker in a distributed fashion, or similar?)

garlick commented 3 years ago

t_submit is time stamped by the local broker in a distributed fashion, or similar

It's set from gettimeofday() (wall clock) in the ingest module approximately when it is received. The ingest module runs across all ranks.

dongahn commented 3 years ago

How the fair share table gets updated in the plugin is a different issue, right?

Yes this is a different issue.

dongahn commented 3 years ago

This part confused me though, what do you mean by secondary priority here?

Would it help if we renamed the "admin priority" something like the "nice" value? (thought that would reverse its meaning)

Then there is just one priority, which is always calculated.

We discussed this a bit at our coffee hour. I believe we converged that we don't want to mandate that the priority resulting from the plugin should be capable of strict lexicographical ordering of multiple factors although it can.

garlick commented 3 years ago

Would it help if we renamed the "admin priority" something like the "nice" value? (thought that would reverse its meaning)

Then there is just one priority, which is always calculated.

Not a bad idea, though a bit of extra work. We should probably decide thumbs up or down on this idea before merging #3339 (which is about ready to go if we don't make this change). Probably we should go with a new range of -20 (un-nice) to +20 (nice) like the system one if we rename it to avoid confusion.

What do people think? I don't really have a problem with the current "admin priority" and "queue priority" names, personally.

grondo commented 3 years ago

What do people think? I don't really have a problem with the current "admin priority" and "queue priority" names, personally.

I don't really have a problem with it either, though I do worry about future confusion with multiple things called priority.

I do prefer "priority" as a concept to "nice" value since it is numerically more intuitive. (though our "default" admin priority of 16 is kind of unintuitive to me)

dongahn commented 3 years ago

I don't have a problem as far as everyone can look at the same page about what it is. The thing that assured me during our coffee hour is that, there are other ways for an admin to override the priorities calculated from the plugin to make a job ahead of anything else.

chu11 commented 3 years ago

While looking at https://github.com/flux-framework/flux-core/issues/3327, I realized a potential issue.

When the job-info module is initialized, it reads from the job eventlog to recreate job state. But RFC21 says we shouldn't put the priority event in the eventlog. So there's no way to know the "queue priority" of a job from the eventlog.

No biggie if we're reloading job-info and job-manager at about the same time (i.e. bootstrapping a instance), we normally get that from the journal coming from the job-manager.

But if we're just reloading the job-info module, we may never get a priority event in the journal. Do we need some way to ping the job-manager and say "hey, send out this info in a journal event"? Without the queue priority, job-info may not be able to sort jobs properly. Which maybe isn't the biggest deal?

garlick commented 3 years ago

Yeah as mentioned in #3327, I agree with your assessment and think we should have two separate events. I was thinking that we wouldn't need the event in the eventlog because job-manager should just flip back to PRIORITY from SCHED during a replay, and let the priority plugin recalculate the priority. But a state-changing event must be in the persistent eventlog or our design is broken! I'll submit a change to the RFC. Sorry for the misstep.

chu11 commented 3 years ago

idea - for the alloc event, perhaps we log the "final" queue priority for the job before it was run? so that when on restart job-info will be able to load that data?

Edit: this is assuming the value is something we want to present to users in flux jobs. If not, we don't need it. It would only be needed in job-info for sorting of pending jobs.

garlick commented 3 years ago

I call a design issue pause (on myself)!

If we post the priority event to the eventlog, then when the job manager restarts, the eventlog replay will cause it to skip the PRIORITY state and not recalculate the initial priority. So the queue order would be wrong until the first reprioritization pass.

If we've said the priority plugin can do some initialization that might involve phoning home to flux-accounting the first time it sees a user (and that was one motivation for the PRIORITY state) but NOT after that, how does that work after a restart if the initial priority came from the eventlog?

On a restart, if job-info replays the eventlog then applies the journal, it won't get the most up to date priority unless the last round of queue prioirty updates fits in the journal.

garlick commented 3 years ago

I think one misstep was allowing the SCHED state to regress to PRIORITY state across a job manager restart without a transitioning event. If we added, say, a priority-invalidate event that the job manager posts when it replays the eventlog of a job and finds it in the SCHED state, then a) job manager would contact the priority plugin again to re-establish the priority, and b) external entities like job-info can accurately track the job state.

RFC 21 has these rules

Replaying the job eventlog SHALL accurately reproduce the current job state.
The job state machine SHALL NOT contain cycles, to avoid ambiguity in synchronization.

We are violating both of these rules now. We would not be violating the first one if add priority-invalidate. We could delete the no-cycles rule or soften it to "...SHOULD NOT contain cycles..."

On synchronizing job-info, first I don't think the queue priority should be considered valid when the job is not in SCHED state (in a sense it is not in a queue at that point). I do think, for testing if nothing else, users should be able to fetch the priority value, for example via a custom flux jobs format, but it would be appropriate to print - or similar if the job is not in SCHED state.

Does that make sense?

Edit: oh we still have the problem of getting the latest priority if job-info is reloaded, since the journal might have wrapped. Hmm.

grondo commented 3 years ago

We could delete the no-cycles rule or soften it to "...SHOULD NOT contain cycles..."

Maybe step back and think about why you wanted to avoid cycles originally. If it seems important, then I would keep the rule and figure out how to avoid the cycle. If not, then just delete the no-cycles rule. Stating SHOULD NOT then doing it anyway in the same RFC could be considered a bit odd.

garlick commented 3 years ago

Maybe step back and think about why you wanted to avoid cycles originally. If it seems important, then I would keep the rule and figure out how to avoid the cycle. If not, then just delete the no-cycles rule. Stating SHOULD NOT then doing it anyway in the same RFC could be considered a bit odd.

Yeah that's true, I probably shouldn't have suggested that option.

I do remember one issue we discussed was allowing a cycle from RUN to SCHED state for grow. I was against this as it seemed to me it would make it much harder to reason about and synchronize on things like whether or not the guest namespace existed, or whether a job had resources or shells. I'm not sure the blanket statement was appropriate though. I would lean towards deleting it, and deciding the impact of each proposal to add a cycle on a case by case basis.

I may have overstated the case in order to make the design more understandable at the time (back when there were infinite possibilities!)

I hope I'm not arguing for this now just because I'm not being creative.

grondo commented 3 years ago

That sounds reasonable to me.

It would impossible to avoid cycles if any job state needs to be re-obtained after a job-manager restart. That makes me wonder, would a more generic reload event or similar be more useful than priority-invalidate, or do you think priority is the only job state that would need to be invalidated on reload?

I wonder if in the future we'll want to be able to revert back to DEPEND state for jobs for some reason, e.g. if we allow job updates and a new dependency or minimum job begin-time is applied.

Sorry, I'm probably not helping here.

garlick commented 3 years ago

Oh I view the eventlog replay as a process that restores the current state, not actually re-acquiring those states. For example, during the replay of an inactive job, we don't send an alloc request to the scheduler as it passes through SCHED state.

We actually want the job to transition from SCHED back to PRIORITY on a restart because we want the priority plugin to be contacted in that conext to allow for an asynchronous response. Now maybe that design point could be separately debated. I think the result of the previous discusson on it was that we should allow for the plugin to go get info about a user that it doesn't know about (or at least the possibility of that) before the job becomes schedulable.

Does that make sense and did I understand your point?

I wonder if in the future we'll want to be able to revert back to DEPEND state for jobs for some reason, e.g. if we allow job updates and a new dependency or minimum job begin-time is applied.

Yeah a transition from ALLOC back to DEPEND probably does make sense in that case IMHO. So I guess that's a +1 in favor of dropping the no-cycles rule.

chu11 commented 3 years ago

On synchronizing job-info, first I don't think the queue priority should be considered valid when the job is not in SCHED state (in a sense it is not in a queue at that point). I do think, for testing if nothing else, users should be able to fetch the priority value, for example via a custom flux jobs format, but it would be appropriate to print - or similar if the job is not in SCHED state.

That sounds good. Its definitely not needed except for testing / curiosity. Perhaps queue priority is available on a "best effort" level for non-pending jobs.

Edit: oh we still have the problem of getting the latest priority if job-info is reloaded, since the journal might have wrapped. Hmm.

Just throwing out an idea. Could the job-manager be queried with a set of jobids and the queue priorities could be returned? Or perhaps it simply informs the job-manager to write those queue priorities to the journal and doesn't even have to respond to the RPC?

grondo commented 3 years ago

Sorry, I misspoke, by "state" I meant job data acquired by some part of the system which isn't available in the eventlog, the priority plugin initialization being one example.

On Sat, Nov 14, 2020, 10:18 AM Jim Garlick notifications@github.com wrote:

Oh I view the eventlog replay as a process that restores the current state, not actually re-acquiring those states. For example, during the replay of an inactive job, we don't send an alloc request to the scheduler as it passes through SCHED state.

We actually want the job to transition from SCHED back to PRIORITY on a restart because we want the priority plugin to be contacted in that conext to allow for an asynchronous response. Now maybe that design point could be separately debated. I think the result of the previous discusson on it was that we should allow for the plugin to go get info about a user that it doesn't know about (or at least the possibility of that) before the job becomes schedulable.

Does that make sense and did I understand your point?

I wonder if in the future we'll want to be able to revert back to DEPEND state for jobs for some reason, e.g. if we allow job updates and a new dependency or minimum job begin-time is applied.

Yeah a transition from ALLOC back to DEPEND probably does make sense in that case IMHO. So I guess that's a +1 in favor of dropping the no-cycles rule.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/flux-framework/flux-core/issues/3311#issuecomment-727245823, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFVEUUQSN6MFDFG6RIJUKDSP3CVXANCNFSM4TMALYEQ .

garlick commented 3 years ago

by "state" I meant job data acquired by some part of the system which isn't available in the eventlog, the priority plugin initialization being one example.

You mean like if we had another state like PRIORITY used to acquire other data with the same semantics? I guess then you could use the same reload event to transition out of either state. That seems unlikely to me but I wouldn't be opposed to renaming priority-invalidate to reload or similar just in case. I guess it would mean that we would want to post a reload event to every active job, not just those in SCHED state. Hmm, actually, that could well be a useful thing to log for running jobs too, for diagnosing future issues. OK, I kind of like that!

garlick commented 3 years ago

Just throwing out an idea. Could the job-manager be queried with a set of jobids and the queue priorities could be returned? Or perhaps it simply informs the job-manager to write those queue priorities to the journal and doesn't even have to respond to the RPC?

Yeah I was mulling around ideas like that, but wanted to be sure we don't a) pollute the journal concept, or b) create more ordering issues that increase complexity for journal consumer.

chu11 commented 3 years ago

Yeah I was mulling around ideas like that, but wanted to be sure we don't a) pollute the journal concept, or b) create more ordering issues that increase complexity for journal consumer.

Slightly alternate idea, RPC to the job-manager tells it to replay the entire eventlog journal for a specific jobid.

garlick commented 3 years ago

Slightly alternate idea, RPC to the job-manager tells it to replay the entire eventlog journal for a specific jobid.

Except we don't currently keep any of the events around in the job manager...

I was just wondering if we should get rid of the JOURNAL_ONLY event idea and instead have a streaming RPC that tracks requested non-persistent attributes like priority?

You'd know better than I what complications would arise when trying to consume data from both, when they are only loosely synchronized. I guess you'd have to deal with a priority update before you know about the job from the journal. Maybe if those things were kept in a separate hash in job-info that's accessed on demand (rather than having an RPC in the job-manager to provide them on demand?) Could we go one further and put that "attribute hash" or priority hash or whatever in a reusable libjob container?

I think we've gone off topic for the plugin interface, and this issue is getting quite long. I'll open a new issue.

garlick commented 3 years ago

Summary of points discussed by @grondo and @garlick offline:

let's focus on minimal priority plugin interface now, while planning for generalized job manager plugins in the future
there should be a default priority "builtin plugin"
TOML config can select loaded job manager plugins:
- load a glob of plugin paths: [job-manager] plugins = "glob"
- per plugin config also: [job-amanger.plugin.priority] accounting-db = "uri"
let's reuse <flux/plugin.h> to define plugin interfaces
callback: new job received (plugin may set priority asynchronously)
callback: update priority (plugin must respond immediately, e.g. this is a function call)
plugin may request job manager to update all jobs' priorities; job manager would then iterate over priority update
a job manager handle with accessor to module flux_t will be needed
a handle representing a job will be needed
plugin may store arbitrary data with job manager or job handles

I might have missed a few nuances! Please fill in @grondo

grondo commented 3 years ago

Looks pretty good to me!

callback: new job received (plugin may set priority asynchronously)

callback: update priority (plugin must respond immediately, e.g. this s a function call)

Will a priority plugin also need a callback when a job is inactive? e.g. if it has some internal state related to jobs it will need to know when it can free that state (assuming the state is not directly associated with the job handle, in which case the state will be auto-destroyed).

Therefore, for future extensibility, I'd propose a set of job. callbacks, starting with job.new and job.inactive here. Maybe it would make more sense to call these job event callbacks, which are issued when the job transitions to the named state. Later, a callback for each job state could be added, along with other events like job.update (but that is for the generic job-manager plugin case)

dongahn commented 3 years ago

Here are some of my random notes from today's coffee hour so that @cmoussa1 will have some context for tomorrow:

Each job state transition even will be passed to the priority plugin. Expectation is that the plugin will use the information to extract requisite global metadata data (e.g., how many running jobs a user has and use that to enforce running job limit etc)
The jobspec for each job will be passed to the plugin as well. If the user uses non-default bank, the plugin can find the bank info the jobspec's attributes section. Different banks will likely produce different priorities. (TODO: Ask Ryan Day to see how important it is to support a single use multiple banks configurations at LLNL.)
There needs to be a way for a priority plugin to be able to register an RPC -- this way, the plugin can use it to receive a bulk fairshare update from flux-accounting on each priority recalculation. RPC needs to be highly performing utilizing asynchronous computations as much as possible. You don't want this third-parity RPC to head-of-line-block other critical RPCs of job-manager. We will need a benchmark to measure its performance (scaling the job count to 10K or similar).
A job needs to be able to get its priority updated while it is in either PRIORITY or SCHED state.
We agreed that bank validation etc should happen by the priority plugin. We need a way for the plugin to raise an exception. Initial we said flux_raise_exception but it seems there are some implication using that. So details are TBD.

chu11 commented 3 years ago

Began working on a prototype implementation for #3325, mostly to get my head around libflux/plugin.h b/c I didn't know the API too well. Some design thoughts/questions I hit along the way (dunno if this should be spun off into a different issue or discussion, posting here for now).

callback: update priority (plugin must respond immediately, e.g. this is a function call)

I was imagining a plugin callback that was something like:

int update_priority (int urgency, uint64_t t_submit, unsigned int *priority);

or with our job-manager job handle

int update_priority (int urgency, uint64_t t_submit, job_manager_job_t *job);

But with the plugin handler call:

int flux_plugin_call (flux_plugin_t *p, const char *name,                                                                                   
                      flux_plugin_arg_t *args);

I don't think you can pass a pointer. There are non-optimal ways to do it:

via the void *arg you set when the handler is created you could set the "job-manager handle", the job could be looked up in the job-manager context and set there. But having the user lookup the job seems like more work than we should expect them to do? Especially for the default priority plugin, which is just going to return the same urgency it was given.
via an aux set/get, which isn't the most obvious/expected way to do it
hackish way, pass pointer via {s:I} in the args, I don't think that's a good idea

Almost feel like flux_plugin_f should be:

typedef int (*flux_plugin_f) (flux_plugin_t *p,
                              const char *topic,
                              flux_plugin_arg_t *args,
                              void *handler_data,
                              void *call_data);

handler_data being the "global" arg set at handler creation time and call_data set on each call.

callback: new job received (plugin may set priority asynchronously)

Using the current plugin API and the flux_plugin_call(), wasn't obvious how to do this. I assumed a prep-check reactor in the plugin to occasionally process new jobs and then call some job_manager_set_job_priority (flux_jobid_t id, unsigned int priority)?

Perhaps would be easier to setup a service that the job-manager would send RPCs too?

Both not the easiest API for a plugin writer. But we can add a convenience library similar to libschedutil that would hide gory details.

Extra Note:

during my prototyping, I did learn that it's scary for the plugin to do searches for job in the job hash. B/c if you have something like this:

job = zhashx_first()
while (job) {
   call update job;
   job = zhashx_next();
}

and "update job" does a zhashx_lookup(), that messes up the current pointer in the iteration. I'm not sure of how to design this safely going for the time being, but something to keep in mind.

Extra Extra Note:

b/c the job-manager ctx->active_jobs is created via job_hash_create() with a bunch of special duplicators and what not set, zhashx_dup() is not safe to use with ctx->active_jobs. Learned this the hard way as I tried to hack around the above issue while prototyping.

See: https://github.com/zeromq/czmq/issues/2144

garlick commented 3 years ago

Uh oh, I think @grondo already has a prototype working for this. I'll let him respond to your comments on the plugin.h interface. I'm pretty sure we want the job manager to do all lookups/iterations on the job hash on behalf of the plugin (or unbeknownst to the plugin).

grondo commented 3 years ago

Yeah, sorry @chu11, before the break I spent a little time on a job manager plugin API which would hopefully be sufficient to develop a priority plugin. The intent was to just come up with the plugin-side of the API (e.g. equivalent to flux/shell.h), but that ended up being difficult to do without also making a prototype - though the prototype ended up being quite simple.

First some general comments on the plugin.h API:

Almost feel like flux_plugin_f should be:
typedef int (*flux_plugin_f) (flux_plugin_t *p,
                              const char *topic,
                              flux_plugin_arg_t *args,
                              void *handler_data,
                              void *call_data);
handler_data being the "global" arg set at handler creation time and call_data set on each call.

void * args to plugin callbacks were purposely left out of the API. When raw pointers to internal data are passed as plugin args, it becomes too tempting to pass pointers to structs which might change between different versions of Flux, thus meaning that a plugin will only work with the version of flux-core it was compiled against. We want to avoid that when possible.

There's a way around this though, and that is to share opaque "handles" with the plugin via the flux_plugin_t *p aux container. This is what is done in the shell plugin API:

/*  Get flux_shell_t object from flux plugin handle
 */
flux_shell_t * flux_plugin_get_shell (flux_plugin_t *p);

/*  Return the current task for task_init, task_exec, and task_exit callbacks:
 *
 *  Returns NULL in any other context.
 */
flux_shell_task_t * flux_shell_current_task (flux_shell_t *shell);

This is safer and more efficient since you aren't passing the flux_shell_t * and flux_shell_task_t * as void *arg, and they are only fetched as needed.

callback: new job received (plugin may set priority asynchronously)

Using the current plugin API and the flux_plugin_call(), wasn't obvious how to do this. I assumed a prep-check reactor in the plugin to occasionally process new jobs and then call some job_manager_set_job_priority (flux_jobid_t id, unsigned int priority)?

In the prototype, I simply added a call to the plugin to each new state transition, before event_job_action(). A priority plugin could then subscribe for a state.priority callback, which would indicate a job is entering the PRIORITY state. The plugin would then either set a priority immediately by pushing a priority on the FLUX_PLUGIN_ARG_OUT. If it doesn't set a priority immediately, the job would stay in PRIORITY state until the plugin asynchronously sets a new priority. How this is done exactly is TDB, but I think @garlick agreed with your idea that a new RPC for setting job priority would work (it would be job-manager sending an RPC to itself), or we could add a new function to the plugin API like

int flux_plugin_set_job_priority (flux_plugin_t *p, flux_jobid_t id, unsigned int priority);

Which would trigger the job manager to move this job out of PRIORITY state on the next check/prep watcher or something.

The other open question is how to allow the plugin to request a reprioritization of all jobs. I think we had said the plugin would request reprioritization and then the job manager would call a priority.set or similar callback for each job in its own good time.

The current prototype is here (without any support for asynchronous priority setting in PRIORITY state, nor a way to request a reprioritization of all jobs)

https://github.com/grondo/flux-core/commit/b57858fab159e86759778c979af15d123870e946

The prototype calls job-manager plugins "jobtap" plugins (because flux_job_manager_plugin.h was too long, and also because they "tap" into job state transitions)

Instead of adding a flux_jobtap_job_t * handle with getters for each bit of job information, the jobtap interface passes all the job data as args in the flux_plugin_arg_t, e.g.:

    flux_plugin_arg_t *args = flux_plugin_arg_create ();
    flux_plugin_arg_pack (args,
                          FLUX_PLUGIN_ARG_IN,
                          "{s:O s:I s:i s:i s:i s:I s:f}",
                          "jobspec", job->jobspec_redacted,
                          "id", job->id,
                          "userid", job->userid,
                          "urgency", job->urgency,
                          "state", job->state,
                          "priority", job->priority,
                          "t_submit", job->t_submit);

The prototype includes 3 "builtin" plugins, e.g. the following is the builtin.default plugin:

/*  The current implementation of priority.set just copies
 *   the urgency to the priority.
 */
static int priority_cb (flux_plugin_t *p,
                        const char *topic,
                        flux_plugin_arg_t *args,
                        void *data)
{
    int urgency = 16;
    flux_plugin_arg_unpack (args, FLUX_PLUGIN_ARG_IN,
                            "{s:i}",
                            "urgency", &urgency);
    flux_plugin_arg_pack (args, FLUX_PLUGIN_ARG_OUT,
                          "{s:i}",
                          "priority", urgency);
    return 0;
}

int default_priority_plugin_init (flux_plugin_t *p)
{
    if (flux_plugin_set_name (p, "priority") < 0
        || flux_plugin_add_handler (p,
                                    "state.priority",
                                    priority_cb,
                                    NULL) < 0
        || flux_plugin_add_handler (p,
                                    "priority.set",
                                    priority_cb,
                                    NULL) < 0) {
        return -1;
    }
    return 0;
}

and a hold plugin, which sets prioirty=0 on all incoming jobs:

/* Always set priority=0 so all jobs are submitted in held state
 */
static int hold_cb (flux_plugin_t *p,
                        const char *topic,
                        flux_plugin_arg_t *args,
                        void *arg)
{
    flux_plugin_arg_pack (args, FLUX_PLUGIN_ARG_OUT, "{s:i}", "priority", 0);
    return 0;
}

int hold_priority_plugin_init (flux_plugin_t *p)
{
    return flux_plugin_add_handler (p, "state.priority", hold_cb, NULL);
}

Finally, here's a plugin that subscribes to all jobtap callbacks and dumps all available info for each job:

static int cb (flux_plugin_t *p,
               const char *topic,
               flux_plugin_arg_t *args,
               void *arg)
{
    json_t *resources;
    flux_jobid_t id;
    uint32_t userid;
    int urgency;
    unsigned int priority;
    flux_job_state_t state;
    double t_submit;
    char *s;

    flux_plugin_arg_unpack (args, FLUX_PLUGIN_ARG_IN,
                            "{s:{s:o} s:I s:i s:i s:i s:i s:f}",
                            "jobspec", "resources", &resources,
                            "id", &id,
                            "userid", &userid,
                            "urgency", &urgency,
                            "priority", &priority,
                            "state", &state,
                            "t_submit", &t_submit);

    s = json_dumps (resources, JSON_COMPACT);
    fprintf (stderr, "%s: id=%ju uid=%u urg=%d pri=%u state=%d t_submit=%.5f\n",
             topic, id, userid, urgency, priority, state, t_submit);
    fprintf (stderr, "resources: %s\n", s);
    free (s);
    return 0;
}

int flux_plugin_init (flux_plugin_t *p)
{
    return flux_plugin_add_handler (p, "*", cb, NULL);
}

For testing purposes, the branch allows plugins to be loaded via a new job-manager.jobtap RPC. The builtin "demo" plugins can be loaded as builtin.default, builtin.age, or builtin.hold, or a path to a DSO can be specified. Use the following python script:

import flux
import sys
print(flux.Flux().rpc("job-manager.jobtap", {"load": sys.argv[1]}).get())

The special none argument will unload the current jobtap plugin and disable jobtap (again for testing purposes).

chu11 commented 3 years ago

it becomes too tempting to pass pointers to structs which might change between different versions of Flux, thus meaning that a plugin will only work with the version of flux-core it was compiled against. We want to avoid that when possible.

Yup I agree. I had read "plugin may store arbitrary data with job manager or job handles" in the design notes above as meaning having access to struct job * and such. But it meaning access in an opaque getter/setter manner makes more sense.

The plugin would then either set a priority immediately by pushing a priority on the FLUX_PLUGIN_ARG_OUT.

Ahhh, this was the big subtlety I missed in the plugin API. That the args parameter could be in & out. That's the way a value could be passed back to the caller.

If it doesn't set a priority immediately, the job would stay in PRIORITY state until the plugin asynchronously sets a new priority. How this is done exactly is TDB, but I think @garlick agreed with your idea that a new RPC for setting job priority would work (it would be job-manager sending an RPC to itself), or we could add a new function to the plugin API like

int flux_plugin_set_job_priority (flux_plugin_t *p, flux_jobid_t id, unsigned int priority);

Which would trigger the job manager to move this job out of PRIORITY state on the next check/prep watcher or something.

It seems you reached the same TBD point that I did. I like the idea of the flux_plugin_set_job_priority() like function, but I kept on coming back to trying to make sure the plugin writer would never have to setup their own prep/check to do asynchronous priority calculations. But perhaps it is inevitable that a plugin writer would have to on occasions.

grondo commented 3 years ago

but I kept on coming back to trying to make sure the plugin writer would never have to setup their own prep/check to do asynchronous priority calculations. But perhaps it is inevitable that a plugin writer would have to on occasions.

Yeah, IIUC in most cases a priority plugin would be able to return a priority immediately on entry to the PRIORITY state, so asynchronous exit from this state would be the uncommon case.

The use case I believe we were considering is that a fair share plugin might have to contact a database or other service when, for example, it sees the first job for a given user. In this case, the plugin might send an RPC to a remote service to fetch data for that user, and then would update the job priority in the continuation callback of that RPC, when it has the necessary information.

I don't see a need for the plugin itself to register prep/check or idle watchers in this case. The job manager itself should be responsible for any work that should be done in this manner (if necessary).

chu11 commented 3 years ago

The use case I believe we were considering is that a fair share plugin might have to contact a database or other service when, for example, it sees the first job for a given user. In this case, the plugin might send an RPC to a remote service to fetch data for that user, and then would update the job priority in the continuation callback of that RPC, when it has the necessary information.

I was also thinking of a case where the plugin wanted to "batch" calculate a bunch of priorities. Perhaps the database needs to do some big calculation or something so we're going to wait awhile. Perhaps this is not a common case and we don't have to think about it too much.

garlick commented 3 years ago

As I recall the idea we talked about was to have some way for the plugin to notify the job manager that it wants to iterate over all jobs (or perhaps all jobs in a particular set of states), calling the plugin callback for each job. That would let the job manager do the prep/check/idle, if necessary to avoid becoming unresponsive if, say, it needs to iterate over a million jobs.

chu11 commented 3 years ago

As I recall the idea we talked about was to have some way for the plugin to notify the job manager that it wants to iterate over all jobs (or perhaps all jobs in a particular set of states), calling the plugin callback for each job. That would let the job manager do the prep/check/idle, if necessary to avoid becoming unresponsive if, say, it needs to iterate over a million jobs.

I was thinking more about the case a bajillion new jobs were submitted in short order. So the priority plugin got a bunch of job.new notifications in short order.

garlick commented 3 years ago

Oh I see. Hmm, well I guess that sort of thing would be possible since the plugin has access to the flux_t handle and reactor, but the idea would be to minimize the need to place less burden on the plugin writer.

chu11 commented 3 years ago

Oh I see. Hmm, well I guess that sort of thing would be possible since the plugin has access to the flux_t handle and reactor, but the idea would be to minimize the need to place less burden on the plugin writer.

Its certainly not a common case and probably one we don't need to fret about too much right now. I'm sure my thoughts were just wandering over many ideas over the long weekend :-)

grondo commented 3 years ago

I was thinking more about the case a bajillion new jobs were submitted in short order. So the priority plugin got a bunch of job.new notifications in short order.

The batching could be done in a plugin by having internal state that keeps a list of jobs which need their priority updated if an asynchronous update is in progress.

For example, the first job state.priority callback triggers an RPC to a database service and puts the current job as the first job on a list of pending updates. While the RPC is in progress, new jobs are simply appended to the internal list. Once the RPC continuation callback runs, the priority plugin could iterate over all jobs recalculating their priorities and calling e.g. flux_plugin_set_job_priority(p, jobid, priority) for each job.

Otherwise, if no RPC is in progress and the priority can be calculated immediately (i.e. all factors are currently available), then the plugin would just calculate and immediately return the priority.

Note that a plugin could also schedule a periodic database query by using a timer or periodic watcher on the flux handle. That RPC could be treated the same as above, or once it completes, the plugin could request a priority recalc for all jobs.

garlick commented 3 years ago

Oh duh, that could be very common. Good discussion!

grondo commented 3 years ago

Should have been closed by #3464, though there may be some unimplemented thoughts from this issue we want to push off to new issues. (Feel free to reopen if desired)

flux-framework / flux-core

job-manager: add priority plugin interface #3311