Closed milroy closed 3 weeks ago
@grondo or @garlick I don't think the tests I've added are sufficient to test the desired reload behavior. I'm going to devise more, but I'd like your input if you have suggestions.
Fantastic! I'm going to do a little manual testing and see if I can come up with some ideas for automated testing.
So, the expected shortcoming here is an node-exclusive allocation may not remain exclusive after a scheduler reload? Does that matter? My understand is if you ask for say one core of a node with 32 cores but set the exclusive flag (or if the scheduler is configured for node exclusivity), you get all 32 cores. So on reload, the scheduler would just see R with the whole node allocated and it shouldn't matter if the node was marked exclusive or not because the "unused" cores are not really unused as far as the scheduler is concerned.
@grondo and I both gave this a quick sanity check where we submitted a sleep inf
job then reloaded the fluxion modules. We found that it works for a simple config (yay!)
sched-fluxion-qmanager.debug[0]: requeue success (queue=default id=378547973585371136)
but not when queues are configured:
jobmanager_hello_cb: ENOENT: map::at: No such file or directory
This same error was noted with rv1
earlier in this comment. Possibly I should have opened a separate issue on it, sorry!
Hmm, my job that failed with queues configured did not set attributes.system.scheduler.queue
in R. That appears to be necessary for fluxion to find the right queue:
See also:
I wonder why it is important to put the job into the correct queue after it's already been allocated resources? Assuming that's just the container that is available, what if we just always put these jobs in the default queue?
So, the expected shortcoming here is an node-exclusive allocation may not remain exclusive after a scheduler reload?
Actually, this PR should handle that case. I am concerned that there are some edge cases where the exclusivity flags per edge aren't set correctly, and then the traversal here: https://github.com/flux-framework/flux-sched/blob/cb951738bbba0e296678b2e5120484da9e1950e5/resource/traversers/dfu_impl_update.cpp#L627
doesn't set the filters correctly. The implication is probably just loss of traversal optimization rather than a problem with the allocation.
I wonder why it is important to put the job into the correct queue after it's already been allocated resources? Assuming that's just the container that is available, what if we just always put these jobs in the default queue?
That makes sense to me. Should I add logic to this PR to insert the job in the default queue regardless of whatever attributes.system.scheduler.queue
says since it's being deprecated?
I guess if it makes sense to you! Since I guess the default queue name isn't defined when named queues are being used I just did this and it seems to work.
diff --git a/qmanager/modules/qmanager_callbacks.cpp b/qmanager/modules/qmanager_callbacks.cpp
index ae8b68f7..205e1101 100644
--- a/qmanager/modules/qmanager_callbacks.cpp
+++ b/qmanager/modules/qmanager_callbacks.cpp
@@ -125,10 +125,7 @@ int qmanager_cb_t::jobmanager_hello_cb (flux_t *h, const flux_msg_t *msg,
{
int rc = 0;
- json_t *o = NULL;
- json_error_t err;
std::string R_out;
- char *qn_attr = NULL;
std::string queue_name;
std::shared_ptr<queue_policy_base_t> queue;
std::shared_ptr<job_t> running_job = nullptr;
@@ -148,28 +145,8 @@ int qmanager_cb_t::jobmanager_hello_cb (flux_t *h, const flux_msg_t *msg,
goto out;
}
- if ( (o = json_loads (R, 0, &err)) == NULL) {
- rc = -1;
- errno = EPROTO;
- flux_log (h, LOG_ERR, "%s: parsing R for job (id=%jd): %s %s@%d:%d",
- __FUNCTION__, static_cast<intmax_t> (id),
- err.text, err.source, err.line, err.column);
- goto out;
- }
- if ( (rc = json_unpack (o, "{ s?:{s?:{s?:{s?:s}}} }",
- "attributes",
- "system",
- "scheduler",
- "queue", &qn_attr)) < 0) {
- json_decref (o);
- errno = EPROTO;
- flux_log (h, LOG_ERR, "%s: json_unpack for attributes", __FUNCTION__);
- goto out;
- }
-
- queue_name = qn_attr? qn_attr : ctx->opts.get_opt ().get_default_queue_name ();
- json_decref (o);
- queue = ctx->queues.at (queue_name);
+ queue_name = ctx->queues.begin()->first;
+ queue = ctx->queues.begin()->second;
running_job = std::make_shared<job_t> (job_state_kind_t::RUNNING,
id, uid, calc_priority (prio),
ts, R);
I'll see if I can work up a test script that covers the basics here. The tests that are failing (before the patch I posted above) are
17/92 Test #17: t1009-recovery-multiqueue.t ...........***Failed 20.74 sec
[snip]
not ok 6 - recovery: works when both modules restart (rv1)
#
# flux dmesg -C &&
# remove_qmanager &&
# reload_resource match-format=rv1 policy=high &&
# load_qmanager_sync &&
# check_requeue ${jobid1} batch &&
# check_requeue ${jobid2} debug &&
# test_must_fail flux job wait-event -t 0.5 ${jobid3} start &&
# test_expect_code 3 flux ion-resource info ${jobid3}
#
expecting success:
flux cancel ${jobid2} &&
flux job wait-event -t 10 ${jobid4} start
flux-job: wait-event timeout on event 'start'
not ok 7 - recovery: a cancel leads to a job schedule (rv1)
#
# flux cancel ${jobid2} &&
# flux job wait-event -t 10 ${jobid4} start
#
expecting success:
flux cancel ${jobid1} &&
flux cancel ${jobid3} &&
flux cancel ${jobid4} &&
flux job wait-event -t 10 ${jobid4} release
1713212995.938612 release ranks="all" final=true
[snip]
# failed 2 among 10 test(s)
I'll see if I can work up a test script that covers the basics here. The tests that are failing (before the patch I posted above) are
I'm a bit confused. Those tests pass for me before applying the patch above and fail after applying the patch.
Oh, hmm, maybe I wasn't testing the right thing. Will double check.
I reteseted with your branch and that test is not failing. Sorry about that!
@milroy - I'm out of time for today but I did push a new test script to my rv1-update-jim branch.
I guess that test breakage is mine :-(
@milroy - I'm out of time for today but I did push a new test script to my rv1-update-jim branch.
Thanks for the help @garlick!
jobmanager_hello_cb: ENOENT: map::at: No such file or directory
This problem appears to be due to the default queue name ("default"
) key not existing in the ctx->queues
map. Apparently it never gets added.
It looks like the default name getter is returning garbage:
queue_it = ctx->queues.find (ctx->opts.get_opt ().get_default_queue_name ());
if (queue_it != ctx->queues.end ()) {
queue = queue_it->second;
queue_name = queue_it->first;
} else {
flux_log (h, LOG_ERR, "%s: queue %s not recognized; default is %s ",
__FUNCTION__, queue_name,
ctx->opts.get_opt ().get_default_queue_name ());
rc = -1;
goto out;
}
And the log output:
Apr 16 01:17:52.765851 sched-fluxion-qmanager.err[0]: jobmanager_hello_cb: queue ???L?? not recognized; default is ??L??
The default is set here: https://github.com/flux-framework/flux-sched/blob/cb951738bbba0e296678b2e5120484da9e1950e5/qmanager/modules/qmanager_opts.hpp#L137
and there doesn't appear to be a way to update the value. The opt getter appears to be corrupting the value somehow: https://github.com/flux-framework/flux-sched/blob/cb951738bbba0e296678b2e5120484da9e1950e5/src/common/liboptmgr/optmgr.hpp#L71
In case it helps, here is some background on the default queue - it is confusing in this context because queues in fluxion predate queues in flux-core. (I hope my fluxion assertions are correct as I'm making them from memory)
In flux-core, when no named queues are configured, there is a single anonymous queue. The default queue is instantiated in fluxion when no named queues are configured in flux core.
When named queues are configured in flux-core, there is no anonymous queue, and only the named queues are instantiated in fluxion. The default queue is undefined.
To make matters more confusing, flux-core has a concept of a default queue. However in this context, the default queue is a jobspec modification at job ingest time. If a default queue is not configured then jobs that don't specify a queue when named queues are configured are rejected at ingest.
The tests that are failing after my patch was added may indicate a problem with the approach of placing allocated jobs in a random (first) queue, particularly this test, where a job in the same queue as a job that was "recovered" is supposed to be scheduled when the recovered job is canceled:
not ok 7 - recovery: a cancel leads to a job schedule (rv1)
#
# flux cancel ${jobid2} &&
# flux job wait-event -t 10 ${jobid4} start
#
Possibly the problem is that a queue that has reached a point where nothing in it can be scheduled is not revisited until the allocated jobs are removed from it. Thus placing a job in a random queue on restart may prevent that queue from making progress when the job is freed.
This comment in queue_policy_base.hpp
bolsters this somewhat
/*! Return true if this queue has become schedulable since
* its state had been reset with set_schedulability (false).
* "Being schedulable" means one or more job or resource events
* have occurred such a way that the scheduler should run the
* scheduling loop for the pending jobs: e.g., a new job was
* inserted into the pending job queue or a job was removed from
* the running job queue so that its resource was released.
*/
bool is_schedulable ();
Since "hello" is not on a critical performance path, maybe the simplest thing would be to look up the jobspec to obtain the queue name. I'll experiment with that.
BTW I think the garbage queue name from flux_log()
may be due ot the fact that it is a std:string
and needs the c_str()
method called on it before being passed to flux_log()
.
I think I've got a more workable solution where the queue name is picked up from the jobspec, however, my new test is failing in interesting ways that I need to understand. I'll post an update soon once I've figured that out.
OK, those tests are passing now. Feel free to cherry pick these commits from my branch (instead of the originals)
caec1a4f501ea7882d83916bcd9d3c5cfc9b248b - qmanager: look up job queue name in KVS
2e2ae23d06bb18a999a36c8ebda0a8de241ece9b - testsuite: add test for reloading with rv1_nosched
I just realized several error paths lead to success in that hello callback, which could cause double booking of resources. So I've got a couple more commits:
bfc1617f89961902a1efa33b9c60f3ad26b161ad qmanager: fix error handling during hello
7f11e191c645fb8ac1cecfee9c34bb7a1237cedd manager: improve unknown queue hello error
The first one at least ought to be included in this PR since it's somewhat serious.
BTW I think the garbage queue name from flux_log() may be due ot the fact that it is a std:string and needs the c_str() method called on it before being passed to flux_log().
You're right. I should have seen that.
In case it helps, here is some background on the default queue - it is confusing in this context because queues in fluxion predate queues in flux-core. (I hope my fluxion assertions are correct as I'm making them from memory)
The background does help- thanks for the additional details. Now I understand what was going on with the missing "default"
queue.
NP! Did you want me to push those commits to your branch and then you can review them a little more conveniently?
I just realized several error paths lead to success in that hello callback, which could cause double booking of resources.
Yikes, thanks for catching this! For the sake of clarity, prior to this PR there's one error path that can lead to a spurious success: https://github.com/flux-framework/flux-sched/blob/8895beafd287c063b498c442481210ee7640eb71/qmanager/modules/qmanager_callbacks.cpp#L141-L148
NP! Did you want me to push those commits to your branch and then you can review them a little more conveniently?
No need! I cherry picked them into my fork and pushed here. Now that I think of it, though, it might be better to update https://github.com/flux-framework/flux-sched/commit/bfc1617f89961902a1efa33b9c60f3ad26b161ad to fix the spurious success return before the PR. (So update the initial value of rc
to be -1
and set rc = 0
before out
and return
.
Thanks! And agreed, I over stated the case on the spurious successes (and that one is pretty unlikely).
It may not be likely, but it's possible. That means it will surface someday if it hasn't already!
Moving on to the duplicate vertex failures. When I uncomment the check for duplicate vertices in the code added by this PR the following tests fail:
The following tests FAILED:
30 - t1022-property-constraints.t (Timeout)
31 - t1023-multiqueue-constraints.t (Timeout)
46 - t3012-resource-properties.t (Failed)
84 - t4011-match-duration.t (Timeout)
(The Valgrind test is also likely failing but hasn't hit the timeout in my running check.) t3012 is failing because for some reason it's checking whether get and set properties succeeds with duplicate nodes (e.g., from this R):
{"version": 1, "execution": {"R_lite": [{"rank": "0-1", "children": {"core": "0-3"}}], "starttime": 0.0, "expiration": 0.0, "nodelist": ["fluke,fluke"]}}
I'm planning to remove that check because it doesn't make sense to me.
The other tests fail because they appear to be adding duplicate hostnames with the same rank:
sched-fluxion-resource.err[0]: grow_resource_db_rv1exec: db.load: unpack_rank: found duplicate vertex in graph for docker-desktop on rank: 1.
Test t1023-multiqueue-constraints.t
is failing because --test-size=N
(in test_under_flux
in sharness.sh
) generates N nodes with the same hostname. I am almost certain the other tests fail for the same reason.
I tried using --test-hosts="fake[0-3]"
, but that gets overridden:
milroy1@docker-desktop:/usr/src$ flux start --test-size=4 --test-hosts="fake[0-3]"
milroy1@docker-desktop:/usr/src$ flux kvs get resource.R
{"version":1,"execution":{"R_lite":[{"rank":"0-3","children":{"core":"0-7"}}],"starttime":0.0,"expiration":0.0,"nodelist":["docker-desktop,docker-desktop,docker-desktop,docker-desktop"]}}
Does anyone know why that might be happening? Alternatively, is there a way to automatically generate N unique hostnames based on --test-size=N
?
That is expected - what changed to make those tests fail (they are passing in CI here)
Multiple brokers per host is a supported configuration.
That is expected - what changed to make those tests fail (they are passing in CI here)
I enabled this check during the graph initialization: https://github.com/flux-framework/flux-sched/blob/14d191ea32b565ca70b8b126bc6c4716d93299e5/resource/readers/resource_reader_rv1exec.cpp#L248-L258
Multiple brokers per host is a supported configuration.
The assumption the current version of this PR makes is that the mapping of rank to host is one-to-one. Since that assumption isn't valid I can't rely on the by_rank
graph metadata to infer exclusivity and check for duplicate vertices. I'll have to think of a different way to do that.
Sorry I should have gotten that from what you had said earlier. Yeah, there will be an N:1 mapping of ranks to hosts when
flux start --test-size=N
(for N > 1)In those situations, each broker rank is treated as a distinct node even though the same cores and GPUs will be seen (and enumerated incorrectly as unique resources) by all co-located brokers. It's pretty useful for testing despite the resource overcommit that will occur.
Sorry I should have gotten that from what you had said earlier.
I don't think what I said was detailed enough to conclude that I was assuming a 1:1 rank:host mapping.
In those situations, each broker rank is treated as a distinct node even though the same cores and GPUs will be seen (and enumerated incorrectly as unique resources) by all co-located brokers.
I appreciate the additional info about the behavior and the use cases. I'll get to work on updating this PR to improve find_vertex
and the exclusivity detection to account for N:1 mappings.
I fixed a case in find_vertex
where the uninitialized vtx
was getting returned when not found in the graph. Since an uninitialized vtx_t
!= boost::graph_traits<resource_graph_t>::null_vertex ()
that caused the check to identify duplicate vertices when initializing the graph.
The existing exclusivity detection works for N:1 mapping of ranks to hosts, so this PR is ready for review.
Reviewing now, hopefully I can get through a pass before I get to the airport.
Thanks for the review and feedback @trws! It may be worth taking a look at the changes I made to address your comment on early exits.
I updated the early exit code sections to be more aligned with how Fluxion currently handles conditional exits. For example:
const auto &vtx_iter = m.by_path.find (path);
// Not found; return null_vertex
if (vtx_iter == m.by_path.end ())
return null_vtx;
// Found in by_path
for (vtx_t v : vtx_iter->second) {
if (g[v].rank == rank
&& g[v].id == id
&& g[v].size == size
&& g[v].type == type) {
return v;
Versus:
const auto &vtx_iter = m.by_path.find (path);
// Not found; return null_vertex
if (vtx_iter == m.by_path.end ()) {
return null_vtx;
} else {
// Found in by_path
for (vtx_t v : vtx_iter->second) {
if (g[v].rank == rank
[. . .]
Do you have a preference between these?
Do you have a preference between these?
The former. It's a style I'm pretty sure I learned from @garlick and @grondo in flux-core. We have a lot of code that has a lot of potential error propagation conditions, if we do the second style then every time we check one we have to increase a nesting level, where the former leaves everything flat. To me it makes the code much more readable, because you can easily tell the difference between a condition that's an early exit to propagate an error or unknown condition and code where there are actually two different code-paths during normal execution.
Avoiding nesting isn't a hard and fast rule, but this case is about as close as it gets for me. If the then branch of the condition ends in either a return or a goto cleanup
or similar, it shouldn't have an else because the execution is identical whether it nests or not.
To me it makes the code much more readable, because you can easily tell the difference between a condition that's an early exit to propagate an error or unknown condition and code where there are actually two different code-paths during normal execution.
That makes sense and I agree, but I thought I'd check.
Bounced the go test runner, looks like it failed to fetch the image for some reason. Feel free to set MWP @milroy, great stuff!
Thanks for the detailed and helpful feedback @trws and @garlick! Setting MWP.
This PR adds support for
update ()
withrv1exec
/rv1_nosched
format. Previously, only JGF was implement and supported for Fluxion reload. SinceRV1
without thescheduling
key doesn't contain information about edges, exclusivity, subsystems, sizes, etc., assumptions need to be made about the graph structure. The only supported subsystem is currentlycontainment
. Ranks are assumed to exist at the node level, and node-to-child edges are assumed to be marked exclusive. Applying the reader to a resource graph created from another format (e.g., JGF) may result in undesirable behavior due to the mismatch between graph granularities.Handling ranks in the unit tests is different than other
resource-query
-based tests. That is due to the fact thatresource-query
simulates the behavior of the scheduler independent of the resource manager and execution system. To enablesharness
tests withresource-query
, the files need to explicity map ranks>=0
to resources.Note that the implementation in this PR does not update the aggregate or exclusivity filters. That functionality is performed by the traverser here: https://github.com/flux-framework/flux-sched/blob/cb951738bbba0e296678b2e5120484da9e1950e5/resource/traversers/dfu_impl_update.cpp#L613-L644
~Note that there is one TODOs:~
find_vertex ()
function when creating vertices results in errors in existing unit tests. I'm reasonably sure this is due to the fact that only one rank exists (-1
) for the unit tests, but I need to double-check.Resolves issue #991.