Closed grondo closed 1 week ago
I verified this case (reloading sched-fluxion-resource
with match-format=rv1_nosched
) is missed in the Fluxion testsuite. Strangely, many other cases are tested, so I wonder if this was purposeful?
Anyway, adding this test to the testsuite demonstrates the issue:
diff --git a/t/t1007-recovery-full.t b/t/t1007-recovery-full.t
index 5b2de636..0923eeaa 100755
--- a/t/t1007-recovery-full.t
+++ b/t/t1007-recovery-full.t
@@ -150,6 +150,18 @@ test_expect_success 'recovery: qmanager restarts (rv1_nosched->rv1_nosched)' '
test_expect_code 3 flux ion-resource info ${jobid5}
'
+test_expect_success 'recovery: both modules restart (rv1_nosched->rv1_nosched)' '
+ reload_resource match-format=rv1_nosched \
+ policy=high &&
+ reload_qmanager &&
+ flux module stats sched-fluxion-qmanager &&
+ flux module stats sched-fluxion-resource &&
+ flux ion-resource info ${jobid1} | grep "ALLOCATED" &&
+ flux ion-resource info ${jobid2} | grep "ALLOCATED" &&
+ flux ion-resource info ${jobid3} | grep "ALLOCATED" &&
+ flux ion-resource info ${jobid4} | grep "ALLOCATED"
+'
+
test_expect_success 'recovery: a cancel leads to a job schedule (rv1_nosched)' '
flux job cancel ${jobid1} &&
flux job wait-event -t 60 ${jobid5} start
I started to look into fixing this issue, but got lost fairly quickly. It seems like we need a way to convert an R object without the scheduling key into a jgf representation so it can be passed to the run()
function to update the graph. It seems like this should be possible using the rv1exec
reader, then emitting jgf, but I'm unsure of the exact mechanism for that.
Perhaps one of the developers more familiar with Fluxion (@jameshcorbett, @milroy, or @trws) could take a crack at it or offer some advice. Thanks!
Reloading the sched-fluxion-qmanager module
with a running job is sufficient to reproduce this:
2023-01-26T17:33:25.033690Z sched-fluxion-qmanager.debug[0]: handshaking with sched-fluxion-resource completed
2023-01-26T17:33:25.033973Z job-manager.debug[0]: scheduler: hello
2023-01-26T17:33:25.038243Z sched-fluxion-qmanager.err[0]: jobmanager_hello_cb: ENOENT: map::at: No such file or directory
2023-01-26T17:33:25.038288Z sched-fluxion-qmanager.err[0]: hello: error loading R for id=131481139893239808: No such file or directory
2023-01-26T17:33:25.038351Z sched-fluxion-qmanager.err[0]: handshake_jobmanager: schedutil_hello: No such file or directory
2023-01-26T17:33:25.038363Z sched-fluxion-qmanager.err[0]: handshake: handshake_jobmanager: No such file or directory
2023-01-26T17:33:25.038373Z sched-fluxion-qmanager.err[0]: mod_start: handshake: No such file or directory
2023-01-26T17:33:25.040393Z sched-fluxion-qmanager.debug[0]: service_unregister
2023-01-26T17:33:25.040490Z sched-fluxion-qmanager.crit[0]: module exiting abnormally
A test could just leave resource running.
I started looking into this issue. While it is true that Fluxion features the rv1_nosched
match format and the rv1exec
reader, Fluxion doesn't support recovery with the rv1_nosched
writer, in particular:
By omitting the scheduling key, "rv1_nosched" will result in higher scheduling performance. However, this format will not contain sufficient information to reconstruct the state of sched-fluxion-resource on module reload (as required for system instance failure recovery).
After looking at rv1_nosched
output, I agree with the documentation. R_lite
and nodelist
do not provide unique identifiers to find and update the corresponding resource graph vertices. Is it possible to use rv1
for the system instance instead, or is the performance too low?
I could be wrong, but I think the issue is the amount of data added to R
in the scheduling
key, and the fact that job data, including R
, will remain in the KVS for a long time at the system instance level, which could contribute to content store growth.
Can you expand on your statement that R_lite
does not have the right identifiers to find update existing graph vertices correctly, for those of us that do not know the internals of the graph implementation very well? To the casual observer, it would seem that since Fluxion can read and construct a graph from R_lite
, it should also be able to reconstruct the necessary information to find those same vertices in an existing graph created from the same format?
The intention appears to have been to use the rv1
writer for the system instance:
# system-instance will use full-up rv1 writer
# so that R will contain scheduling key needed
# for failure recovery.
match-format = "rv1"
It may be possible to refactor Fluxion to change writers at runtime. If I understand correctly then if restarted under certain conditions Fluxion could switch writers from rv1_nosched
to rv1
.
Can you expand on your statement that R_lite does not have the right identifiers to find update existing graph vertices correctly, for those of us that do not know the internals of the graph implementation very well?
Yes, here's the writer output for a job match with the rv1_nosched
writer:
{"version": 1, "execution": {"R_lite": [{"rank": "-1", "children": {"core": "35"}}], "nodelist": ["node1"], "starttime": 0, "expiration": 3600}}
In contrast, JGF provides each resource's unique IDs and graph paths which allow for full resolution of the vertex:
{"graph": {"nodes": [{"id": "79", "metadata": {"type": "core", "basename": "core", "name": "core35", "id": 35, "uniq_id": 79, "rank": -1, "exclusive": true, "unit": "", "size": 1, "paths": {"containment": "/tiny0/rack0/node1/socket1/core35"}}}, {"id": "7", "metadata": {"type": "socket", "basename": "socket", "name": "socket1", "id": 1, "uniq_id": 7, "rank": -1, "exclusive": true, "unit": "", "size": 1, "paths": {"containment": "/tiny0/rack0/node1/socket1"}}}, {"id": "3", "metadata": {"type": "node", "basename": "node", "name": "node1", "id": 1, "uniq_id": 3, "rank": -1, "exclusive": false, "unit": "", "size": 1, "paths": {"containment": "/tiny0/rack0/node1"}}}, {"id": "1", "metadata": {"type": "rack", "basename": "rack", "name": "rack0", "id": 0, "uniq_id": 1, "rank": -1, "exclusive": false, "unit": "", "size": 1, "paths": {"containment": "/tiny0/rack0"}}}, {"id": "0", "metadata": {"type": "cluster", "basename": "tiny", "name": "tiny0", "id": 0, "uniq_id": 0, "rank": -1, "exclusive": false, "unit": "", "size": 1, "paths": {"containment": "/tiny0"}}}], "edges": [{"source": "7", "target": "79", "metadata": {"name": {"containment": "contains"}}}, {"source": "3", "target": "7", "metadata": {"name": {"containment": "contains"}}}, {"source": "1", "target": "3", "metadata": {"name": {"containment": "contains"}}}, {"source": "0", "target": "1", "metadata": {"name": {"containment": "contains"}}}]}}
that since Fluxion can read and construct a graph from
R_lite
I didn't know that, and I'm struggling to understand how that happens. Fluxion only supports grug
, hwloc
, jgf
, and rv1exec
as graph load formats, and of those only hwloc
, rv1exec
, and jgf
for creating the resource graph via populate_resource_db_acquire
I didn't know that, and I'm struggling to understand how that happens.
Sorry, I meant rv1exec
Yes, here's the writer output for a job match with the rv1_nosched writer:
I'm assuming that {"rank": "-1"}
is an error?
I think {"rank": "-1"}
is just an artifact of running this in resource-query
vs a running Flux instance.
Fluxion could switch writers from rv1_nosched to rv1
What I have in mind is fairly complicated and may not work in the end. It would consist of dumping the resources of running jobs to the KVS via rv1
writer when Flux stops. Let me know if there's interest and we can discuss it at a coffee time.
That might work for a planned and orderly shutdown, but we also have to support a restart after a broker crash, in which case this mechanism could not be used.
Could we devise something to put into the scheduling
key of an rv1exec
object which would be brief but give Fluxion enough of a hint to find the correct vertices in the graph and free them? Naively, would a mapping between rank
and unique_id
work? (I'm actually guessing not, because if that was enough, then Fluxion could create this at runtime since broker ranks are already unique, however, I guess it cannot hurt to ask)
If that isn’t sufficient I’d be somewhat interested to know why not, because it seems like the kind of problem we could solve with an index if it’s not currently feasible. Do we currently require the full path keys?
but we also have to support a restart after a broker crash, in which case this mechanism could not be used.
Good point.
Could we devise something to put into the scheduling key of an rv1exec object which would be brief but give Fluxion enough of a hint to find the correct vertices in the graph and free them?
Yes, I think so. I'll refresh my memory on the graph update
functions and how they use the global lookups to devise some minimum schema.
This came up again because the inability of Fluxion to be reloaded with running jobs is preventing us from reconfiguring resources (e.g. adding or modifying queues) without a downtime on production clusters.
As an experiment, I tried setting match-format = "rv1"
and then seeing if Fluxion modules can be reloaded with running jobs, but I still get the error:
[ +9.794686] job-manager[0]: scheduler: hello
[ +9.797168] sched-fluxion-qmanager[0]: jobmanager_hello_cb: ENOENT: map::at: No such file or directory
[ +9.797231] sched-fluxion-qmanager[0]: raising fatal exception on running job id=63504667937603584
The job in question does seem to have JGF in R:
$ flux job info 63504667937603584 R | jq .scheduling | head
{
"graph": {
"nodes": [
{
"id": "17",
"metadata": {
"type": "core",
"basename": "core",
"name": "core0",
"id": 0,
It is possible I performed the experiment incorrectly, so it may be good if someone else can verify this behavior. If so, then we'll need a plan to address this issue in the near term (i.e. this issue just got high priority)
The only line that should be able to produce that error is this one: https://github.com/flux-framework/flux-sched/blob/c65026c3cd27b8a314e686457ae4579acb3d1509/qmanager/modules/qmanager_callbacks.cpp#L172
I haven't had time to go in and experiment with it yet, but this makes me think that the map of queues doesn't get repopulated on a restart, we might not even be getting to the graph.
Ah, ok, thats a good point and something (hopefully) easy to fix as a start?
I did verify that with no queues enabled and match-format = "rv1"
, Fluxion can be reloaded with running jobs.
I'll open a separate issue.
Yeah, I don't think that will be all that bad. I'm a bit confused how that can happen, since everything is called in the right order, but if we have a reasonable reproducing test case this should be relatively straightforward. Glancing at it I'm guessing it's something like the list of queues in the initial config got cleared or some generated name changed or... 🤷 If we add in a bit of context to figure out which queue is missing and which queues exist it should fall out quickly.
Just following up on the meeting today:
fluxion gets the initial Rv1 object from the resource.acquire
RPC, described in RFC 28. The Rv1 object is returned in the first response payload. This is designed so resources are automatically returned when fluxion unloads and can be re-acquired when it loads again. https://github.com/flux-framework/flux-sched/blob/fe872c8dc056934e4073b5fb2932335bb69ca73a/resource/modules/resource_match.cpp#L1282
Fluxion then uses the job-manager.hello
RPC to acquire a list of running jobs. That one is described in RFC 27. Although the Rv1 object allocated to each job is not provided in the response payload, the libschedutil
helper library from flux-core fetches it from the kvs and provides it to the scheduler in the hello callback. Thus the wall clock expiration time for each allocated Rv1 is available. https://github.com/flux-framework/flux-sched/blob/fe872c8dc056934e4073b5fb2932335bb69ca73a/qmanager/modules/qmanager_callbacks.cpp#L123
So my probably naive question is what required information is missing to allow fluxion to restart with running jobs and rv1_nosched?
In our meeting today it was asserted that the Rv1 to graph uuid mapping would need to be preserved (in a file or KVS) across a restart in order to map the resources of running jobs to the resource graph. Afterward I'm once again wondering why. The uuid's are purely internal to fluxion - fluxion won't receive a uuid from a running job that was allocated from a past instance of fluxion and need to map it to resources. It will receive rv1 with hostname/ranks and gpu/core logical indices. I don't see how having the old uuid mapping would help.
What am I missing?
I ended up deciding the vertex uniq_id
(I mistakenly called it a UUID) wasn't necessary to identify the underlying graph vertex. I actually think it is possible to reconstruct the vertices' information from RV1 with hostnames and logical indices of a rank's children provided that the original resource graph was also generated from the rv1exec
reader.
However, edge data (such as exclusivity tracking) can't be reconstructed without a fully specified format like JGF. Here's an example of what the JGF reader does to update the edges: https://github.com/flux-framework/flux-sched/blob/c8e03f8416b3d7803c6108de5ea8f15292728568/resource/readers/resource_reader_jgf.cpp#L1048. That said, I don't know if not setting those parameters will result in undesirable scheduling behavior or only in reduction of optimizations.
Thanks for that explanation! Well, I think having the rv1 reader, even a naive one, would be an excellent near term step since it would let the scheduler be unloaded and reconfigured or even updated without affecting running jobs. When the scheduler is misbehaving, that could be really handy.
Edit: we could always augment that with some side band storage or whatever if it turns out to be required.
The Flux system instance needed to be restarted on tioga recently and there were two active jobs in CLEANUP state. This caused Fluxion to fail to restart with the following errors:
It appears that
parse_R()
inresource/modules/resource_match.cpp
always requires ascheduling
key be set with JGF, but this will not be the case for any job whensched-fluxion-resource.match-format = "rv1_nosched"
.Since Fluxion supports a rv1exec reader,
parse_R()
should fall back to this reader when there is no scheduling key in R for a job.