flux-framework / flux-sched

Fluxion Graph-based Scheduler
GNU Lesser General Public License v3.0
84 stars 39 forks source link

qmanager: defer jobs with unmatchable constraints #1188

Closed trws closed 1 week ago

trws commented 3 weeks ago

This is the in-progress PR for the constraint job blocking problem. Full description below, but we're still lacking two things I really want to have:

  1. [ ] A test that will surface this if it comes back, other than looking at timing, this is very hard to make an integration test for. A unit test would be easier, trying to come up with something.
  2. [ ] A fix for the stats collection logic that includes failed matches, both the number of them and times of them since that being missing is part of what helped this hide so well.

problem: Jobs with constraints that can't be matched because nodes are down or drained are currently considered every time we enter the scheduling loop. If they reach the head of the queue, which is likely because we currently only configure one sched queue, they get re-considered over and over despite the fact they can't run, which greatly slows down scheduling, and can cause severe blocking observed up to 20 seconds of delay for a single submission. This change also exposed a bug with the duration calculation for duration=0.0 jobs, which used to be set to the full duration of the graph rather than the remaining duration of the graph.

solution: Add a new "m_blocked" member to the qmanager base which holds jobs which return EBUSY from an alloc_orelse_reserve. This state can only happen when the job is blocked by a constraint requiring a node in an unusable state. The jobs in m_blocked are moved into m_pending (ready to be considered) by the notify callback. Currently they are moved regardless of what status changes occurred (a node being drained causes them to move) but it's a relatively small cost to move them back afterward and simplifies the logic considerably. The duration for 0.0 duration jobs is now set to the remaining time rather than the total time during meta build time.

trws commented 2 weeks ago

I see the coverage failure, and will see what I can do about that, but that's going to be a few days. Looks like the uploader we're using is deprecated, and the actual error is because the uploader is using too old a version of gcov compared to the version of gcc on bookworm, but it only matters on qmanager for some reason. 🤷

grondo commented 2 weeks ago

We switched to codecov-action@v4 in flux-core awihle ago. I can attempt to do an update here as well (along with the other deprecated actions) then we can rebase this one.

trws commented 2 weeks ago

That would be awesome if you could @grondo!

trws commented 2 weeks ago

I think I have a test to use, I just have to incorporate it into sharness. Thankfully the unreservable aspect is an easy test, submit two jobs requiring the same resource with no time limit first one that runs at least until the second is considered, wait for both. If they both succeed it's all good.

The performance regression case I'm not sure how to test in a way that's reliable. Maybe with the updates @milroy is adding for stats, that way we could watch for number of failed matches and see if it's repeatedly trying to schedule when it can't? We may have to revisit that one.

trws commented 2 weeks ago

Ok, added the test I was using locally basically as-is. I didn't see anywhere that it clearly fit, but if anyone knows a spot I can merge it into some other file.

trws commented 2 weeks ago

Ok, I think this is all cleaned up. Here are the highlights:

  1. @milroy was completely right, there was no need for the extra ==ENOENT check or the change to duration logic after the other changes, so I removed those.
  2. rather than trying to fix up the python script shebang, I added an invocation of flux env around the test script so that we always get the flux-side python path added on
  3. formatting nits applied, planning to come back to this and get clang-format and clang-tidy working on this repo so I stop missing things like that
  4. commit adding the test now actually runs the test (slightly important :facepalm:)
milroy commented 2 weeks ago

This looks cleaner and easier to understand; thanks! I also like that you renamed the ov variable. Fluxion has too many inscrutable variables.

I'm wondering if there is a way to include a test for jobs with unsatisfiable constraints being reconsidered many times. You could submit a few jobs that require a down node and then undrain the node. After the stats update PR #1187 gets merged you could then check for the number of failed matches.

trws commented 2 weeks ago

I was thinking much the same. If we can get the stats PR in today it would be much easier to write a deterministic test for it with our current setup. In fact I think my existing test would work, it just needs some extra jobs added and a stats check.


Sent from Workspace ONE Boxerhttps://whatisworkspaceone.com/boxer On May 1, 2024 at 12:21:52 AM PDT, Daniel Milroy @.***> wrote:

This looks cleaner and easier to understand; thanks! I also like that you renamed the ov variable. Fluxion has too many inscrutable variables.

I'm wondering if there is a way to include a test for jobs with unsatisfiable constraints being reconsidered many times. You could submit a few jobs that require a down node and then undrain the node. After the stats update PR #1187https://urldefense.us/v3/__https://github.com/flux-framework/flux-sched/pull/1187__;!!G2kpM7uM-TzIFchu!3ysUsHM_4wNKdWvUhB_nrCia0qqXTkYf2ZDn8JgwDElXd9UmwK_pVHSCXY1ndll2B0KHnw_OSZqJoantred-UCnycl4$ gets merged you could then check for the number of failed matches.

— Reply to this email directly, view it on GitHubhttps://urldefense.us/v3/__https://github.com/flux-framework/flux-sched/pull/1188*issuecomment-2088084833__;Iw!!G2kpM7uM-TzIFchu!3ysUsHM_4wNKdWvUhB_nrCia0qqXTkYf2ZDn8JgwDElXd9UmwK_pVHSCXY1ndll2B0KHnw_OSZqJoantred-72JDVLo$, or unsubscribehttps://urldefense.us/v3/__https://github.com/notifications/unsubscribe-auth/AAFBFNIZUJDBPU5LO24BO3TZACJYRAVCNFSM6AAAAABG3XX4EKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAOBYGA4DIOBTGM__;!!G2kpM7uM-TzIFchu!3ysUsHM_4wNKdWvUhB_nrCia0qqXTkYf2ZDn8JgwDElXd9UmwK_pVHSCXY1ndll2B0KHnw_OSZqJoantred-c34ixAk$. You are receiving this because you authored the thread.Message ID: @.***>

milroy commented 2 weeks ago

In fact I think my existing test would work, it just needs some extra jobs added and a stats check.

The stats PR got merged. I agree that just adding some extra jobs should be sufficient to test this.

trws commented 2 weeks ago

Ok, the test has been extended. We'll have to see how well it holds. As far as I understand the system, it should reliably hit 10, but it's possible it's partially timing dependent. If we find that it's unreliable we may want to adjust the test to <= something.

trws commented 2 weeks ago

@milroy, any chance I could convince you to take a (hopefully) last pass over here? I think this is about where we need it.

milroy commented 1 week ago

@trws: yes, looking now.

codecov[bot] commented 1 week ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

:exclamation: No coverage uploaded for pull request base (master@a4eb20a). Click here to learn what that means.

:exclamation: Current head 29063a7 differs from pull request most recent head 80457fa. Consider uploading reports for the commit 80457fa to get more accurate results

Additional details and impacted files ```diff @@ Coverage Diff @@ ## master #1188 +/- ## ======================================== Coverage ? 74.0% ======================================== Files ? 102 Lines ? 14611 Branches ? 0 ======================================== Hits ? 10822 Misses ? 3789 Partials ? 0 ``` | [Files](https://app.codecov.io/gh/flux-framework/flux-sched/pull/1188?dropdown=coverage&src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=flux-framework) | Coverage Δ | | |---|---|---| | [qmanager/modules/qmanager.cpp](https://app.codecov.io/gh/flux-framework/flux-sched/pull/1188?src=pr&el=tree&filepath=qmanager%2Fmodules%2Fqmanager.cpp&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=flux-framework#diff-cW1hbmFnZXIvbW9kdWxlcy9xbWFuYWdlci5jcHA=) | `73.4% <100.0%> (ø)` | | | [qmanager/policies/base/queue\_policy\_base.hpp](https://app.codecov.io/gh/flux-framework/flux-sched/pull/1188?src=pr&el=tree&filepath=qmanager%2Fpolicies%2Fbase%2Fqueue_policy_base.hpp&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=flux-framework#diff-cW1hbmFnZXIvcG9saWNpZXMvYmFzZS9xdWV1ZV9wb2xpY3lfYmFzZS5ocHA=) | `74.3% <100.0%> (ø)` | | | [qmanager/policies/queue\_policy\_bf\_base\_impl.hpp](https://app.codecov.io/gh/flux-framework/flux-sched/pull/1188?src=pr&el=tree&filepath=qmanager%2Fpolicies%2Fqueue_policy_bf_base_impl.hpp&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=flux-framework#diff-cW1hbmFnZXIvcG9saWNpZXMvcXVldWVfcG9saWN5X2JmX2Jhc2VfaW1wbC5ocHA=) | `89.1% <100.0%> (ø)` | | | [resource/modules/resource\_match.cpp](https://app.codecov.io/gh/flux-framework/flux-sched/pull/1188?src=pr&el=tree&filepath=resource%2Fmodules%2Fresource_match.cpp&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=flux-framework#diff-cmVzb3VyY2UvbW9kdWxlcy9yZXNvdXJjZV9tYXRjaC5jcHA=) | `69.3% <100.0%> (ø)` | | | [resource/traversers/dfu.cpp](https://app.codecov.io/gh/flux-framework/flux-sched/pull/1188?src=pr&el=tree&filepath=resource%2Ftraversers%2Fdfu.cpp&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=flux-framework#diff-cmVzb3VyY2UvdHJhdmVyc2Vycy9kZnUuY3Bw) | `87.9% <100.0%> (ø)` | | | [resource/traversers/dfu\_impl.hpp](https://app.codecov.io/gh/flux-framework/flux-sched/pull/1188?src=pr&el=tree&filepath=resource%2Ftraversers%2Fdfu_impl.hpp&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=flux-framework#diff-cmVzb3VyY2UvdHJhdmVyc2Vycy9kZnVfaW1wbC5ocHA=) | `94.7% <100.0%> (ø)` | |