Closed trws closed 1 week ago
I see the coverage failure, and will see what I can do about that, but that's going to be a few days. Looks like the uploader we're using is deprecated, and the actual error is because the uploader is using too old a version of gcov compared to the version of gcc on bookworm, but it only matters on qmanager for some reason. 🤷
We switched to codecov-action@v4
in flux-core awihle ago. I can attempt to do an update here as well (along with the other deprecated actions) then we can rebase this one.
That would be awesome if you could @grondo!
I think I have a test to use, I just have to incorporate it into sharness. Thankfully the unreservable aspect is an easy test, submit two jobs requiring the same resource with no time limit first one that runs at least until the second is considered, wait for both. If they both succeed it's all good.
The performance regression case I'm not sure how to test in a way that's reliable. Maybe with the updates @milroy is adding for stats, that way we could watch for number of failed matches and see if it's repeatedly trying to schedule when it can't? We may have to revisit that one.
Ok, added the test I was using locally basically as-is. I didn't see anywhere that it clearly fit, but if anyone knows a spot I can merge it into some other file.
Ok, I think this is all cleaned up. Here are the highlights:
==ENOENT
check or the change to duration logic after the other changes, so I removed those.flux env
around the test script so that we always get the flux-side python path added onThis looks cleaner and easier to understand; thanks! I also like that you renamed the ov
variable. Fluxion has too many inscrutable variables.
I'm wondering if there is a way to include a test for jobs with unsatisfiable constraints being reconsidered many times. You could submit a few jobs that require
a down node and then undrain the node. After the stats update PR #1187 gets merged you could then check for the number of failed matches.
I was thinking much the same. If we can get the stats PR in today it would be much easier to write a deterministic test for it with our current setup. In fact I think my existing test would work, it just needs some extra jobs added and a stats check.
Sent from Workspace ONE Boxerhttps://whatisworkspaceone.com/boxer On May 1, 2024 at 12:21:52 AM PDT, Daniel Milroy @.***> wrote:
This looks cleaner and easier to understand; thanks! I also like that you renamed the ov variable. Fluxion has too many inscrutable variables.
I'm wondering if there is a way to include a test for jobs with unsatisfiable constraints being reconsidered many times. You could submit a few jobs that require a down node and then undrain the node. After the stats update PR #1187https://urldefense.us/v3/__https://github.com/flux-framework/flux-sched/pull/1187__;!!G2kpM7uM-TzIFchu!3ysUsHM_4wNKdWvUhB_nrCia0qqXTkYf2ZDn8JgwDElXd9UmwK_pVHSCXY1ndll2B0KHnw_OSZqJoantred-UCnycl4$ gets merged you could then check for the number of failed matches.
— Reply to this email directly, view it on GitHubhttps://urldefense.us/v3/__https://github.com/flux-framework/flux-sched/pull/1188*issuecomment-2088084833__;Iw!!G2kpM7uM-TzIFchu!3ysUsHM_4wNKdWvUhB_nrCia0qqXTkYf2ZDn8JgwDElXd9UmwK_pVHSCXY1ndll2B0KHnw_OSZqJoantred-72JDVLo$, or unsubscribehttps://urldefense.us/v3/__https://github.com/notifications/unsubscribe-auth/AAFBFNIZUJDBPU5LO24BO3TZACJYRAVCNFSM6AAAAABG3XX4EKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAOBYGA4DIOBTGM__;!!G2kpM7uM-TzIFchu!3ysUsHM_4wNKdWvUhB_nrCia0qqXTkYf2ZDn8JgwDElXd9UmwK_pVHSCXY1ndll2B0KHnw_OSZqJoantred-c34ixAk$. You are receiving this because you authored the thread.Message ID: @.***>
In fact I think my existing test would work, it just needs some extra jobs added and a stats check.
The stats PR got merged. I agree that just adding some extra jobs should be sufficient to test this.
Ok, the test has been extended. We'll have to see how well it holds. As far as I understand the system, it should reliably hit 10, but it's possible it's partially timing dependent. If we find that it's unreliable we may want to adjust the test to <= something.
@milroy, any chance I could convince you to take a (hopefully) last pass over here? I think this is about where we need it.
@trws: yes, looking now.
All modified and coverable lines are covered by tests :white_check_mark:
:exclamation: No coverage uploaded for pull request base (
master@a4eb20a
). Click here to learn what that means.:exclamation: Current head 29063a7 differs from pull request most recent head 80457fa. Consider uploading reports for the commit 80457fa to get more accurate results
This is the in-progress PR for the constraint job blocking problem. Full description below, but we're still lacking two things I really want to have:
problem: Jobs with constraints that can't be matched because nodes are down or drained are currently considered every time we enter the scheduling loop. If they reach the head of the queue, which is likely because we currently only configure one sched queue, they get re-considered over and over despite the fact they can't run, which greatly slows down scheduling, and can cause severe blocking observed up to 20 seconds of delay for a single submission. This change also exposed a bug with the duration calculation for duration=0.0 jobs, which used to be set to the full duration of the graph rather than the remaining duration of the graph.
solution: Add a new "m_blocked" member to the qmanager base which holds jobs which return EBUSY from an alloc_orelse_reserve. This state can only happen when the job is blocked by a constraint requiring a node in an unusable state. The jobs in m_blocked are moved into m_pending (ready to be considered) by the notify callback. Currently they are moved regardless of what status changes occurred (a node being drained causes them to move) but it's a relatively small cost to move them back afterward and simplifies the logic considerably. The duration for 0.0 duration jobs is now set to the remaining time rather than the total time during meta build time.