flux-framework / flux-sched

Fluxion Graph-based Scheduler
GNU Lesser General Public License v3.0
84 stars 39 forks source link

qmanager: Add hello/exception callback support #493

Closed dongahn closed 4 years ago

dongahn commented 4 years ago

This PR fills in hello callback and exception callback as required by job-manager. The hello callback allows qmanager to fill in its running queue as part of load sequence and the exception callback allows for canceling the pending jobs (i.e., jobs with pending alloc requests).

Note that the hello callback is working in progress towards full resilience. More comprehensive solution with respect to resource will require a solution for Issue #470. This isn't planned until later this year.

Resolve #492.

codecov-io commented 4 years ago

Codecov Report

Merging #493 into master will increase coverage by 0.52%. The diff coverage is 80%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #493      +/-   ##
==========================================
+ Coverage   74.94%   75.46%   +0.52%     
==========================================
  Files          60       60              
  Lines        6074     6119      +45     
==========================================
+ Hits         4552     4618      +66     
+ Misses       1522     1501      -21
Impacted Files Coverage Δ
qmanager/policies/base/queue_policy_base.hpp 100% <100%> (ø) :arrow_up:
qmanager/modules/qmanager.cpp 75.72% <70.37%> (+9.06%) :arrow_up:
src/common/libschedutil/hello.c 55.17% <77.77%> (+32.95%) :arrow_up:
qmanager/policies/base/queue_policy_base_impl.hpp 82.35% <86.95%> (+10.18%) :arrow_up:
src/common/libschedutil/alloc.c 66.07% <0%> (+5.35%) :arrow_up:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update ef69e63...777107f. Read the comment docs.

garlick commented 4 years ago

Recovery of userid, t_submit, priority from the job eventlog within the hello phase looks good to me, and so does including the jobid in the free callback. It might be useful to move the eventlog stuff into libschedutil alongside the code that reads in R from the KVS rather than embedding it in qmanager.

19c37e9 is the heart of this PR and doesn't have any commit description, so I would suggest adding a few paragraphs of explanation. Splitting out the changes to libschedutil to their own commit would be helpful in case we backport to flux-core, or to enable standalone testing.

dongahn commented 4 years ago

It might be useful to move the eventlog stuff into libschedutil alongside the code that reads in R from the KVS rather than embedding it in qmanager.

Excellent suggestion! Other schedulers may need this through libschedutil. Will do.

Splitting out the changes to libschedutil to their own commit would be helpful in case we backport to flux-core, or to enable standalone testing.

I thought about this. But qmanager will no longer build at that commit. If having self contained commit is more important than making each commit buildable, I can do this. It seems this is one of those cases.

garlick commented 4 years ago

OK by me then.

Incidentally, my comment about standalone testing applies to the first suggestion (move eventlog code from qmanager to libschedutil), not the second one about splitting the commit. Sorry if that was confusing/silly.

dongahn commented 4 years ago

Maybe, I can move the event code stuff into libschedutil without the API change -- one commit that solely contains schedutil changes alone.

And then the next commit can contain the bulk of the hello callback support and two-line API changes in libschedutil. This way, at least back porting becomes an inch easier...

grondo commented 4 years ago

Should we think about exporting libeventlog from libflux-core.so (or its own dso)? It seems more general purpose and less likely to change than libschedutil.

Also, are the changes here going to be easy to resolve with flux-framework/flux-core#2226?

dongahn commented 4 years ago

Should we think about exporting libeventlog from libflux-core.so (or its own dso)? It seems more general purpose and less likely to change than libschedutil.

I think so -- ultimately. I imagine there would be more external users of libeventlog than schedulers and it does make sense. If this happens, I can latch on to it. (I don't think it is a priority though).

dongahn commented 4 years ago

Also, are the changes here going to be easy to resolve with flux-framework/flux-core#2226?

We will see. But my changes will be pretty isolated with a minimal API change so hopefully this doesn't give too much trouble to @SteVwonder.

garlick commented 4 years ago

Just occurred to me - you can request both R and the eventlog through the job-info module with one RPC (instead of going direct to KVS)...

grondo commented 4 years ago

But doesn't job-info have to do 2 rpcs, so 3 in total and higher latency?

garlick commented 4 years ago

I think job-info fetches keys in parallel so probably the same latency. Yeah, never mind - he's already got code to fetch and parse the eventlog, so I think I'm not being especially helpful here!

garlick commented 4 years ago

Let's just fix the job manager to send the entire queue object in the hello response. I'll propose something in flux-core. Stand by...

dongahn commented 4 years ago

Let's just fix the job manager to send the entire queue object in the hello response. I'll propose something in flux-core. Stand by...

OK. I will hold until I see your changes then. Thanks.

dongahn commented 4 years ago

OK. I will hold until I see your changes then. Thanks.

@garlick: Thank you for the hello protocol changes. Just pushed the changes to make use of them. I believe I addressed all the comments that came up so far. Let me know if these are okay I will squash them before merge.

garlick commented 4 years ago

I think it's OK to squash.

I'll have one more look after you do that but I think the job-manager interface part looks good.

dongahn commented 4 years ago

Thanks @garlick. Squashed and forced a push.

garlick commented 4 years ago

315cce4 (add hello/exception callback) seems to include more change than is described in the commit message, e.g. modifications to the free callback error handling, a reordering of qmanager_new(), and queue policy interface changes. Add to commit message or split out interface changes to their own commit?

LGTM otherwise.

dongahn commented 4 years ago

315cce4 (add hello/exception callback) seems to include more change than is described in the commit message, e.g. modifications to the free callback error handling, a reordering of qmanager_new(), and queue policy interface changes. Add to commit message or split out interface changes to their own commit?

@garlick: I'd be happy to do that. I just split the commit into multiples. To minimize later squash, I took liberty forcing a push. From my perspective, this is ready to be merged.

dongahn commented 4 years ago

Wait... just one more change.

dongahn commented 4 years ago

Done.

dongahn commented 4 years ago

@garlick:

Thank you.

Did you want me to press the button or did you want someone wither a better understanding of the new architecture to do it?

Please go ahead and press the button. If there are any problems with other parts of the code, we can revisit that later. This will help meet our next milestones!

dongahn commented 4 years ago

Thanks!!