flux-framework / flux-sched

Fluxion Graph-based Scheduler
GNU Lesser General Public License v3.0
84 stars 39 forks source link

Implement rv1exec reader update to facilitate Fluxion reload #1176

Closed milroy closed 3 weeks ago

milroy commented 1 month ago

This PR adds support for update () with rv1exec/rv1_nosched format. Previously, only JGF was implement and supported for Fluxion reload. Since RV1 without the scheduling key doesn't contain information about edges, exclusivity, subsystems, sizes, etc., assumptions need to be made about the graph structure. The only supported subsystem is currently containment. Ranks are assumed to exist at the node level, and node-to-child edges are assumed to be marked exclusive. Applying the reader to a resource graph created from another format (e.g., JGF) may result in undesirable behavior due to the mismatch between graph granularities.

Handling ranks in the unit tests is different than other resource-query-based tests. That is due to the fact that resource-query simulates the behavior of the scheduler independent of the resource manager and execution system. To enable sharness tests with resource-query, the files need to explicity map ranks >=0 to resources.

Note that the implementation in this PR does not update the aggregate or exclusivity filters. That functionality is performed by the traverser here: https://github.com/flux-framework/flux-sched/blob/cb951738bbba0e296678b2e5120484da9e1950e5/resource/traversers/dfu_impl_update.cpp#L613-L644

~Note that there is one TODOs:~

Resolves issue #991.

milroy commented 1 month ago

@grondo or @garlick I don't think the tests I've added are sufficient to test the desired reload behavior. I'm going to devise more, but I'd like your input if you have suggestions.

garlick commented 1 month ago

Fantastic! I'm going to do a little manual testing and see if I can come up with some ideas for automated testing.

So, the expected shortcoming here is an node-exclusive allocation may not remain exclusive after a scheduler reload? Does that matter? My understand is if you ask for say one core of a node with 32 cores but set the exclusive flag (or if the scheduler is configured for node exclusivity), you get all 32 cores. So on reload, the scheduler would just see R with the whole node allocated and it shouldn't matter if the node was marked exclusive or not because the "unused" cores are not really unused as far as the scheduler is concerned.

garlick commented 1 month ago

@grondo and I both gave this a quick sanity check where we submitted a sleep inf job then reloaded the fluxion modules. We found that it works for a simple config (yay!)

sched-fluxion-qmanager.debug[0]: requeue success (queue=default id=378547973585371136)

but not when queues are configured:

jobmanager_hello_cb: ENOENT: map::at: No such file or directory
grondo commented 1 month ago

This same error was noted with rv1 earlier in this comment. Possibly I should have opened a separate issue on it, sorry!

garlick commented 1 month ago

Hmm, my job that failed with queues configured did not set attributes.system.scheduler.queue in R. That appears to be necessary for fluxion to find the right queue:

https://github.com/flux-framework/flux-sched/blob/master/qmanager/modules/qmanager_callbacks.cpp#L159

garlick commented 1 month ago

See also:

I wonder why it is important to put the job into the correct queue after it's already been allocated resources? Assuming that's just the container that is available, what if we just always put these jobs in the default queue?

milroy commented 1 month ago

So, the expected shortcoming here is an node-exclusive allocation may not remain exclusive after a scheduler reload?

Actually, this PR should handle that case. I am concerned that there are some edge cases where the exclusivity flags per edge aren't set correctly, and then the traversal here: https://github.com/flux-framework/flux-sched/blob/cb951738bbba0e296678b2e5120484da9e1950e5/resource/traversers/dfu_impl_update.cpp#L627

doesn't set the filters correctly. The implication is probably just loss of traversal optimization rather than a problem with the allocation.

milroy commented 1 month ago

I wonder why it is important to put the job into the correct queue after it's already been allocated resources? Assuming that's just the container that is available, what if we just always put these jobs in the default queue?

That makes sense to me. Should I add logic to this PR to insert the job in the default queue regardless of whatever attributes.system.scheduler.queue says since it's being deprecated?

garlick commented 1 month ago

I guess if it makes sense to you! Since I guess the default queue name isn't defined when named queues are being used I just did this and it seems to work.

diff --git a/qmanager/modules/qmanager_callbacks.cpp b/qmanager/modules/qmanager_callbacks.cpp
index ae8b68f7..205e1101 100644
--- a/qmanager/modules/qmanager_callbacks.cpp
+++ b/qmanager/modules/qmanager_callbacks.cpp
@@ -125,10 +125,7 @@ int qmanager_cb_t::jobmanager_hello_cb (flux_t *h, const flux_msg_t *msg,

 {
     int rc = 0;
-    json_t *o = NULL;
-    json_error_t err;
     std::string R_out;
-    char *qn_attr = NULL;
     std::string queue_name;
     std::shared_ptr<queue_policy_base_t> queue;
     std::shared_ptr<job_t> running_job = nullptr;
@@ -148,28 +145,8 @@ int qmanager_cb_t::jobmanager_hello_cb (flux_t *h, const flux_msg_t *msg,
         goto out;
     }

-    if ( (o = json_loads (R, 0, &err)) == NULL) {
-        rc = -1;
-        errno = EPROTO;
-        flux_log (h, LOG_ERR, "%s: parsing R for job (id=%jd): %s %s@%d:%d",
-                  __FUNCTION__, static_cast<intmax_t> (id),
-                  err.text, err.source, err.line, err.column);
-        goto out;
-    }
-    if ( (rc = json_unpack (o, "{ s?:{s?:{s?:{s?:s}}} }",
-                                   "attributes",
-                                       "system",
-                                           "scheduler",
-                                                "queue", &qn_attr)) < 0) {
-        json_decref (o);
-        errno = EPROTO;
-        flux_log (h, LOG_ERR, "%s: json_unpack for attributes", __FUNCTION__);
-        goto out;
-    }
-
-    queue_name = qn_attr? qn_attr : ctx->opts.get_opt ().get_default_queue_name ();
-    json_decref (o);
-    queue = ctx->queues.at (queue_name);
+    queue_name = ctx->queues.begin()->first;
+    queue = ctx->queues.begin()->second;
     running_job = std::make_shared<job_t> (job_state_kind_t::RUNNING,
                                                    id, uid, calc_priority (prio),
                                                    ts, R);
garlick commented 1 month ago

I'll see if I can work up a test script that covers the basics here. The tests that are failing (before the patch I posted above) are

17/92 Test #17: t1009-recovery-multiqueue.t ...........***Failed   20.74 sec
[snip]
not ok 6 - recovery: works when both modules restart (rv1)
#   
#       flux dmesg -C &&
#       remove_qmanager &&
#       reload_resource match-format=rv1 policy=high &&
#       load_qmanager_sync &&
#       check_requeue ${jobid1} batch &&
#       check_requeue ${jobid2} debug &&
#       test_must_fail flux job wait-event -t 0.5 ${jobid3} start &&
#       test_expect_code 3 flux ion-resource info ${jobid3}
#   

expecting success: 
    flux cancel ${jobid2} &&
    flux job wait-event -t 10 ${jobid4} start

flux-job: wait-event timeout on event 'start'
not ok 7 - recovery: a cancel leads to a job schedule (rv1)
#   
#       flux cancel ${jobid2} &&
#       flux job wait-event -t 10 ${jobid4} start
#   

expecting success: 
    flux cancel ${jobid1} &&
    flux cancel ${jobid3} &&
    flux cancel ${jobid4} &&
    flux job wait-event -t 10 ${jobid4} release

1713212995.938612 release ranks="all" final=true
[snip]
# failed 2 among 10 test(s)
milroy commented 1 month ago

I'll see if I can work up a test script that covers the basics here. The tests that are failing (before the patch I posted above) are

I'm a bit confused. Those tests pass for me before applying the patch above and fail after applying the patch.

garlick commented 1 month ago

Oh, hmm, maybe I wasn't testing the right thing. Will double check.

garlick commented 1 month ago

I reteseted with your branch and that test is not failing. Sorry about that!

garlick commented 1 month ago

@milroy - I'm out of time for today but I did push a new test script to my rv1-update-jim branch.

I guess that test breakage is mine :-(

milroy commented 1 month ago

@milroy - I'm out of time for today but I did push a new test script to my rv1-update-jim branch.

Thanks for the help @garlick!

milroy commented 1 month ago

jobmanager_hello_cb: ENOENT: map::at: No such file or directory

This problem appears to be due to the default queue name ("default") key not existing in the ctx->queues map. Apparently it never gets added.

milroy commented 1 month ago

It looks like the default name getter is returning garbage:

    queue_it = ctx->queues.find (ctx->opts.get_opt ().get_default_queue_name ());
    if (queue_it != ctx->queues.end ()) {
        queue = queue_it->second;
        queue_name = queue_it->first;
    } else {
        flux_log (h, LOG_ERR, "%s: queue %s not recognized; default is %s ",
                  __FUNCTION__, queue_name,
                  ctx->opts.get_opt ().get_default_queue_name ());
        rc = -1;
        goto out;
    }

And the log output:

Apr 16 01:17:52.765851 sched-fluxion-qmanager.err[0]: jobmanager_hello_cb: queue ???L?? not recognized; default is ??L?? 

The default is set here: https://github.com/flux-framework/flux-sched/blob/cb951738bbba0e296678b2e5120484da9e1950e5/qmanager/modules/qmanager_opts.hpp#L137

and there doesn't appear to be a way to update the value. The opt getter appears to be corrupting the value somehow: https://github.com/flux-framework/flux-sched/blob/cb951738bbba0e296678b2e5120484da9e1950e5/src/common/liboptmgr/optmgr.hpp#L71

garlick commented 1 month ago

In case it helps, here is some background on the default queue - it is confusing in this context because queues in fluxion predate queues in flux-core. (I hope my fluxion assertions are correct as I'm making them from memory)

In flux-core, when no named queues are configured, there is a single anonymous queue. The default queue is instantiated in fluxion when no named queues are configured in flux core.

When named queues are configured in flux-core, there is no anonymous queue, and only the named queues are instantiated in fluxion. The default queue is undefined.

To make matters more confusing, flux-core has a concept of a default queue. However in this context, the default queue is a jobspec modification at job ingest time. If a default queue is not configured then jobs that don't specify a queue when named queues are configured are rejected at ingest.

garlick commented 1 month ago

The tests that are failing after my patch was added may indicate a problem with the approach of placing allocated jobs in a random (first) queue, particularly this test, where a job in the same queue as a job that was "recovered" is supposed to be scheduled when the recovered job is canceled:

not ok 7 - recovery: a cancel leads to a job schedule (rv1)
#   
#       flux cancel ${jobid2} &&
#       flux job wait-event -t 10 ${jobid4} start
#   

Possibly the problem is that a queue that has reached a point where nothing in it can be scheduled is not revisited until the allocated jobs are removed from it. Thus placing a job in a random queue on restart may prevent that queue from making progress when the job is freed.

This comment in queue_policy_base.hpp bolsters this somewhat

  /*! Return true if this queue has become schedulable since
     *  its state had been reset with set_schedulability (false).
     *  "Being schedulable" means one or more job or resource events
     *  have occurred such a way that the scheduler should run the
     *  scheduling loop for the pending jobs: e.g., a new job was
     *  inserted into the pending job queue or a job was removed from
     *  the running job queue so that its resource was released.
     */
    bool is_schedulable ();
garlick commented 1 month ago

Since "hello" is not on a critical performance path, maybe the simplest thing would be to look up the jobspec to obtain the queue name. I'll experiment with that.

garlick commented 1 month ago

BTW I think the garbage queue name from flux_log() may be due ot the fact that it is a std:string and needs the c_str() method called on it before being passed to flux_log().

garlick commented 1 month ago

I think I've got a more workable solution where the queue name is picked up from the jobspec, however, my new test is failing in interesting ways that I need to understand. I'll post an update soon once I've figured that out.

garlick commented 1 month ago

OK, those tests are passing now. Feel free to cherry pick these commits from my branch (instead of the originals)

caec1a4f501ea7882d83916bcd9d3c5cfc9b248b - qmanager: look up job queue name in KVS

2e2ae23d06bb18a999a36c8ebda0a8de241ece9b - testsuite: add test for reloading with rv1_nosched

garlick commented 1 month ago

I just realized several error paths lead to success in that hello callback, which could cause double booking of resources. So I've got a couple more commits:

bfc1617f89961902a1efa33b9c60f3ad26b161ad qmanager: fix error handling during hello

7f11e191c645fb8ac1cecfee9c34bb7a1237cedd manager: improve unknown queue hello error

The first one at least ought to be included in this PR since it's somewhat serious.

milroy commented 1 month ago

BTW I think the garbage queue name from flux_log() may be due ot the fact that it is a std:string and needs the c_str() method called on it before being passed to flux_log().

You're right. I should have seen that.

milroy commented 1 month ago

In case it helps, here is some background on the default queue - it is confusing in this context because queues in fluxion predate queues in flux-core. (I hope my fluxion assertions are correct as I'm making them from memory)

The background does help- thanks for the additional details. Now I understand what was going on with the missing "default" queue.

garlick commented 1 month ago

NP! Did you want me to push those commits to your branch and then you can review them a little more conveniently?

milroy commented 1 month ago

I just realized several error paths lead to success in that hello callback, which could cause double booking of resources.

Yikes, thanks for catching this! For the sake of clarity, prior to this PR there's one error path that can lead to a spurious success: https://github.com/flux-framework/flux-sched/blob/8895beafd287c063b498c442481210ee7640eb71/qmanager/modules/qmanager_callbacks.cpp#L141-L148

milroy commented 1 month ago

NP! Did you want me to push those commits to your branch and then you can review them a little more conveniently?

No need! I cherry picked them into my fork and pushed here. Now that I think of it, though, it might be better to update https://github.com/flux-framework/flux-sched/commit/bfc1617f89961902a1efa33b9c60f3ad26b161ad to fix the spurious success return before the PR. (So update the initial value of rc to be -1 and set rc = 0 before out and return.

garlick commented 1 month ago

Thanks! And agreed, I over stated the case on the spurious successes (and that one is pretty unlikely).

milroy commented 1 month ago

It may not be likely, but it's possible. That means it will surface someday if it hasn't already!

Moving on to the duplicate vertex failures. When I uncomment the check for duplicate vertices in the code added by this PR the following tests fail:

The following tests FAILED:
     30 - t1022-property-constraints.t (Timeout)
     31 - t1023-multiqueue-constraints.t (Timeout)
     46 - t3012-resource-properties.t (Failed)
     84 - t4011-match-duration.t (Timeout)

(The Valgrind test is also likely failing but hasn't hit the timeout in my running check.) t3012 is failing because for some reason it's checking whether get and set properties succeeds with duplicate nodes (e.g., from this R):

{"version": 1, "execution": {"R_lite": [{"rank": "0-1", "children": {"core": "0-3"}}], "starttime": 0.0, "expiration": 0.0, "nodelist": ["fluke,fluke"]}}

I'm planning to remove that check because it doesn't make sense to me.

The other tests fail because they appear to be adding duplicate hostnames with the same rank:

sched-fluxion-resource.err[0]: grow_resource_db_rv1exec: db.load: unpack_rank: found duplicate vertex in graph for docker-desktop on rank: 1.
milroy commented 1 month ago

Test t1023-multiqueue-constraints.t is failing because --test-size=N (in test_under_flux in sharness.sh) generates N nodes with the same hostname. I am almost certain the other tests fail for the same reason.

I tried using --test-hosts="fake[0-3]", but that gets overridden:

milroy1@docker-desktop:/usr/src$ flux start --test-size=4 --test-hosts="fake[0-3]"
milroy1@docker-desktop:/usr/src$ flux kvs get resource.R
{"version":1,"execution":{"R_lite":[{"rank":"0-3","children":{"core":"0-7"}}],"starttime":0.0,"expiration":0.0,"nodelist":["docker-desktop,docker-desktop,docker-desktop,docker-desktop"]}}

Does anyone know why that might be happening? Alternatively, is there a way to automatically generate N unique hostnames based on --test-size=N?

garlick commented 1 month ago

That is expected - what changed to make those tests fail (they are passing in CI here)

Multiple brokers per host is a supported configuration.

milroy commented 1 month ago

That is expected - what changed to make those tests fail (they are passing in CI here)

I enabled this check during the graph initialization: https://github.com/flux-framework/flux-sched/blob/14d191ea32b565ca70b8b126bc6c4716d93299e5/resource/readers/resource_reader_rv1exec.cpp#L248-L258

Multiple brokers per host is a supported configuration.

The assumption the current version of this PR makes is that the mapping of rank to host is one-to-one. Since that assumption isn't valid I can't rely on the by_rank graph metadata to infer exclusivity and check for duplicate vertices. I'll have to think of a different way to do that.

garlick commented 1 month ago

Sorry I should have gotten that from what you had said earlier. Yeah, there will be an N:1 mapping of ranks to hosts when

In those situations, each broker rank is treated as a distinct node even though the same cores and GPUs will be seen (and enumerated incorrectly as unique resources) by all co-located brokers. It's pretty useful for testing despite the resource overcommit that will occur.

milroy commented 1 month ago

Sorry I should have gotten that from what you had said earlier.

I don't think what I said was detailed enough to conclude that I was assuming a 1:1 rank:host mapping.

In those situations, each broker rank is treated as a distinct node even though the same cores and GPUs will be seen (and enumerated incorrectly as unique resources) by all co-located brokers.

I appreciate the additional info about the behavior and the use cases. I'll get to work on updating this PR to improve find_vertex and the exclusivity detection to account for N:1 mappings.

milroy commented 4 weeks ago

I fixed a case in find_vertex where the uninitialized vtx was getting returned when not found in the graph. Since an uninitialized vtx_t != boost::graph_traits<resource_graph_t>::null_vertex () that caused the check to identify duplicate vertices when initializing the graph.

The existing exclusivity detection works for N:1 mapping of ranks to hosts, so this PR is ready for review.

trws commented 3 weeks ago

Reviewing now, hopefully I can get through a pass before I get to the airport.

milroy commented 3 weeks ago

Thanks for the review and feedback @trws! It may be worth taking a look at the changes I made to address your comment on early exits.

milroy commented 3 weeks ago

I updated the early exit code sections to be more aligned with how Fluxion currently handles conditional exits. For example:

    const auto &vtx_iter = m.by_path.find (path);
    // Not found; return null_vertex
    if (vtx_iter == m.by_path.end ())
        return null_vtx;
    // Found in by_path
    for (vtx_t v : vtx_iter->second) {
        if (g[v].rank == rank
                && g[v].id == id
                && g[v].size == size
                && g[v].type == type) {
            return v;

Versus:

    const auto &vtx_iter = m.by_path.find (path);
    // Not found; return null_vertex
    if (vtx_iter == m.by_path.end ()) {
        return null_vtx;
    } else {
        // Found in by_path
        for (vtx_t v : vtx_iter->second) {
            if (g[v].rank == rank
            [. . .]

Do you have a preference between these?

trws commented 3 weeks ago

Do you have a preference between these?

The former. It's a style I'm pretty sure I learned from @garlick and @grondo in flux-core. We have a lot of code that has a lot of potential error propagation conditions, if we do the second style then every time we check one we have to increase a nesting level, where the former leaves everything flat. To me it makes the code much more readable, because you can easily tell the difference between a condition that's an early exit to propagate an error or unknown condition and code where there are actually two different code-paths during normal execution.

Avoiding nesting isn't a hard and fast rule, but this case is about as close as it gets for me. If the then branch of the condition ends in either a return or a goto cleanup or similar, it shouldn't have an else because the execution is identical whether it nests or not.

milroy commented 3 weeks ago

To me it makes the code much more readable, because you can easily tell the difference between a condition that's an early exit to propagate an error or unknown condition and code where there are actually two different code-paths during normal execution.

That makes sense and I agree, but I thought I'd check.

trws commented 3 weeks ago

Bounced the go test runner, looks like it failed to fetch the image for some reason. Feel free to set MWP @milroy, great stuff!

milroy commented 3 weeks ago

Thanks for the detailed and helpful feedback @trws and @garlick! Setting MWP.