flux-framework / flux-sched

Fluxion Graph-based Scheduler
GNU Lesser General Public License v3.0
84 stars 39 forks source link

Fluxion can't restart running jobs with match-format `rv1_nosched` #991

Closed grondo closed 1 week ago

grondo commented 1 year ago

The Flux system instance needed to be restarted on tioga recently and there were two active jobs in CLEANUP state. This caused Fluxion to fail to restart with the following errors:

[  +0.001134] job-manager[0]: scheduler: hello
[  +0.002056] sched-fluxion-resource[0]: parse_R: no scheduling key in R
[  +0.002077] sched-fluxion-resource[0]: run_update: parsing R: No such file or directory
[  +0.002084] sched-fluxion-resource[0]: update_request_cb: update failed (id=138630396509161472): No such file or directory
[  +0.002316] sched-fluxion-qmanager[0]: jobmanager_hello_cb: reconstruct (id=138630396509161472 queue=default): No such file or directory
[  +0.002330] sched-fluxion-qmanager[0]: hello: error loading R for id=138630396509161472: No such file or directory
[  +0.002347] sched-fluxion-qmanager[0]: handshake_jobmanager: schedutil_hello: No such file or directory
[  +0.002352] sched-fluxion-qmanager[0]: handshake: handshake_jobmanager: No such file or directory
[  +0.002362] sched-fluxion-qmanager[0]: mod_start: handshake: No such file or directory
[  +0.002728] sched-fluxion-qmanager[0]: service_unregister
[  +0.002759] sched-fluxion-qmanager[0]: module exiting abnormally

It appears that parse_R() in resource/modules/resource_match.cpp always requires a scheduling key be set with JGF, but this will not be the case for any job when sched-fluxion-resource.match-format = "rv1_nosched".

Since Fluxion supports a rv1exec reader, parse_R() should fall back to this reader when there is no scheduling key in R for a job.

grondo commented 1 year ago

I verified this case (reloading sched-fluxion-resource with match-format=rv1_nosched) is missed in the Fluxion testsuite. Strangely, many other cases are tested, so I wonder if this was purposeful?

Anyway, adding this test to the testsuite demonstrates the issue:

diff --git a/t/t1007-recovery-full.t b/t/t1007-recovery-full.t
index 5b2de636..0923eeaa 100755
--- a/t/t1007-recovery-full.t
+++ b/t/t1007-recovery-full.t
@@ -150,6 +150,18 @@ test_expect_success 'recovery: qmanager restarts (rv1_nosched->rv1_nosched)' '
     test_expect_code 3 flux ion-resource info ${jobid5}
 '

+test_expect_success 'recovery: both modules restart (rv1_nosched->rv1_nosched)' '
+    reload_resource match-format=rv1_nosched \
+    policy=high &&
+    reload_qmanager &&
+    flux module stats sched-fluxion-qmanager &&
+    flux module stats sched-fluxion-resource &&
+    flux ion-resource info ${jobid1} | grep "ALLOCATED" &&
+    flux ion-resource info ${jobid2} | grep "ALLOCATED" &&
+    flux ion-resource info ${jobid3} | grep "ALLOCATED" &&
+    flux ion-resource info ${jobid4} | grep "ALLOCATED"
+'
+
 test_expect_success 'recovery: a cancel leads to a job schedule (rv1_nosched)' '
     flux job cancel ${jobid1} &&
     flux job wait-event -t 60 ${jobid5} start
grondo commented 1 year ago

I started to look into fixing this issue, but got lost fairly quickly. It seems like we need a way to convert an R object without the scheduling key into a jgf representation so it can be passed to the run() function to update the graph. It seems like this should be possible using the rv1exec reader, then emitting jgf, but I'm unsure of the exact mechanism for that.

Perhaps one of the developers more familiar with Fluxion (@jameshcorbett, @milroy, or @trws) could take a crack at it or offer some advice. Thanks!

garlick commented 1 year ago

Reloading the sched-fluxion-qmanager module with a running job is sufficient to reproduce this:

2023-01-26T17:33:25.033690Z sched-fluxion-qmanager.debug[0]: handshaking with sched-fluxion-resource completed
2023-01-26T17:33:25.033973Z job-manager.debug[0]: scheduler: hello
2023-01-26T17:33:25.038243Z sched-fluxion-qmanager.err[0]: jobmanager_hello_cb: ENOENT: map::at: No such file or directory
2023-01-26T17:33:25.038288Z sched-fluxion-qmanager.err[0]: hello: error loading R for id=131481139893239808: No such file or directory
2023-01-26T17:33:25.038351Z sched-fluxion-qmanager.err[0]: handshake_jobmanager: schedutil_hello: No such file or directory
2023-01-26T17:33:25.038363Z sched-fluxion-qmanager.err[0]: handshake: handshake_jobmanager: No such file or directory
2023-01-26T17:33:25.038373Z sched-fluxion-qmanager.err[0]: mod_start: handshake: No such file or directory
2023-01-26T17:33:25.040393Z sched-fluxion-qmanager.debug[0]: service_unregister
2023-01-26T17:33:25.040490Z sched-fluxion-qmanager.crit[0]: module exiting abnormally

A test could just leave resource running.

milroy commented 1 year ago

I started looking into this issue. While it is true that Fluxion features the rv1_nosched match format and the rv1exec reader, Fluxion doesn't support recovery with the rv1_nosched writer, in particular:

By omitting the scheduling key, "rv1_nosched" will result in higher scheduling performance. However, this format will not contain sufficient information to reconstruct the state of sched-fluxion-resource on module reload (as required for system instance failure recovery).

After looking at rv1_nosched output, I agree with the documentation. R_lite and nodelist do not provide unique identifiers to find and update the corresponding resource graph vertices. Is it possible to use rv1 for the system instance instead, or is the performance too low?

grondo commented 1 year ago

I could be wrong, but I think the issue is the amount of data added to R in the scheduling key, and the fact that job data, including R, will remain in the KVS for a long time at the system instance level, which could contribute to content store growth.

Can you expand on your statement that R_lite does not have the right identifiers to find update existing graph vertices correctly, for those of us that do not know the internals of the graph implementation very well? To the casual observer, it would seem that since Fluxion can read and construct a graph from R_lite, it should also be able to reconstruct the necessary information to find those same vertices in an existing graph created from the same format?

milroy commented 1 year ago

The intention appears to have been to use the rv1 writer for the system instance:

# system-instance will use full-up rv1 writer
# so that R will contain scheduling key needed
# for failure recovery.
match-format = "rv1"

It may be possible to refactor Fluxion to change writers at runtime. If I understand correctly then if restarted under certain conditions Fluxion could switch writers from rv1_nosched to rv1.

milroy commented 1 year ago

Can you expand on your statement that R_lite does not have the right identifiers to find update existing graph vertices correctly, for those of us that do not know the internals of the graph implementation very well?

Yes, here's the writer output for a job match with the rv1_nosched writer:

{"version": 1, "execution": {"R_lite": [{"rank": "-1", "children": {"core": "35"}}], "nodelist": ["node1"], "starttime": 0, "expiration": 3600}}

In contrast, JGF provides each resource's unique IDs and graph paths which allow for full resolution of the vertex:

{"graph": {"nodes": [{"id": "79", "metadata": {"type": "core", "basename": "core", "name": "core35", "id": 35, "uniq_id": 79, "rank": -1, "exclusive": true, "unit": "", "size": 1, "paths": {"containment": "/tiny0/rack0/node1/socket1/core35"}}}, {"id": "7", "metadata": {"type": "socket", "basename": "socket", "name": "socket1", "id": 1, "uniq_id": 7, "rank": -1, "exclusive": true, "unit": "", "size": 1, "paths": {"containment": "/tiny0/rack0/node1/socket1"}}}, {"id": "3", "metadata": {"type": "node", "basename": "node", "name": "node1", "id": 1, "uniq_id": 3, "rank": -1, "exclusive": false, "unit": "", "size": 1, "paths": {"containment": "/tiny0/rack0/node1"}}}, {"id": "1", "metadata": {"type": "rack", "basename": "rack", "name": "rack0", "id": 0, "uniq_id": 1, "rank": -1, "exclusive": false, "unit": "", "size": 1, "paths": {"containment": "/tiny0/rack0"}}}, {"id": "0", "metadata": {"type": "cluster", "basename": "tiny", "name": "tiny0", "id": 0, "uniq_id": 0, "rank": -1, "exclusive": false, "unit": "", "size": 1, "paths": {"containment": "/tiny0"}}}], "edges": [{"source": "7", "target": "79", "metadata": {"name": {"containment": "contains"}}}, {"source": "3", "target": "7", "metadata": {"name": {"containment": "contains"}}}, {"source": "1", "target": "3", "metadata": {"name": {"containment": "contains"}}}, {"source": "0", "target": "1", "metadata": {"name": {"containment": "contains"}}}]}}

that since Fluxion can read and construct a graph from R_lite

I didn't know that, and I'm struggling to understand how that happens. Fluxion only supports grug, hwloc, jgf, and rv1exec as graph load formats, and of those only hwloc, rv1exec, and jgf for creating the resource graph via populate_resource_db_acquire

grondo commented 1 year ago

I didn't know that, and I'm struggling to understand how that happens.

Sorry, I meant rv1exec

grondo commented 1 year ago

Yes, here's the writer output for a job match with the rv1_nosched writer:

I'm assuming that {"rank": "-1"} is an error?

milroy commented 1 year ago

I think {"rank": "-1"} is just an artifact of running this in resource-query vs a running Flux instance.

milroy commented 1 year ago

Fluxion could switch writers from rv1_nosched to rv1

What I have in mind is fairly complicated and may not work in the end. It would consist of dumping the resources of running jobs to the KVS via rv1 writer when Flux stops. Let me know if there's interest and we can discuss it at a coffee time.

grondo commented 1 year ago

That might work for a planned and orderly shutdown, but we also have to support a restart after a broker crash, in which case this mechanism could not be used.

Could we devise something to put into the scheduling key of an rv1exec object which would be brief but give Fluxion enough of a hint to find the correct vertices in the graph and free them? Naively, would a mapping between rank and unique_id work? (I'm actually guessing not, because if that was enough, then Fluxion could create this at runtime since broker ranks are already unique, however, I guess it cannot hurt to ask)

trws commented 1 year ago

If that isn’t sufficient I’d be somewhat interested to know why not, because it seems like the kind of problem we could solve with an index if it’s not currently feasible. Do we currently require the full path keys?

milroy commented 1 year ago

but we also have to support a restart after a broker crash, in which case this mechanism could not be used.

Good point.

Could we devise something to put into the scheduling key of an rv1exec object which would be brief but give Fluxion enough of a hint to find the correct vertices in the graph and free them?

Yes, I think so. I'll refresh my memory on the graph update functions and how they use the global lookups to devise some minimum schema.

grondo commented 11 months ago

This came up again because the inability of Fluxion to be reloaded with running jobs is preventing us from reconfiguring resources (e.g. adding or modifying queues) without a downtime on production clusters.

As an experiment, I tried setting match-format = "rv1" and then seeing if Fluxion modules can be reloaded with running jobs, but I still get the error:

[  +9.794686] job-manager[0]: scheduler: hello
[  +9.797168] sched-fluxion-qmanager[0]: jobmanager_hello_cb: ENOENT: map::at: No such file or directory
[  +9.797231] sched-fluxion-qmanager[0]: raising fatal exception on running job id=63504667937603584

The job in question does seem to have JGF in R:

$ flux job info 63504667937603584 R | jq .scheduling | head
{
  "graph": {
    "nodes": [
      {
        "id": "17",
        "metadata": {
          "type": "core",
          "basename": "core",
          "name": "core0",
          "id": 0,
grondo commented 11 months ago

It is possible I performed the experiment incorrectly, so it may be good if someone else can verify this behavior. If so, then we'll need a plan to address this issue in the near term (i.e. this issue just got high priority)

trws commented 11 months ago

The only line that should be able to produce that error is this one: https://github.com/flux-framework/flux-sched/blob/c65026c3cd27b8a314e686457ae4579acb3d1509/qmanager/modules/qmanager_callbacks.cpp#L172

I haven't had time to go in and experiment with it yet, but this makes me think that the map of queues doesn't get repopulated on a restart, we might not even be getting to the graph.

grondo commented 11 months ago

Ah, ok, thats a good point and something (hopefully) easy to fix as a start?

I did verify that with no queues enabled and match-format = "rv1", Fluxion can be reloaded with running jobs.

I'll open a separate issue.

trws commented 11 months ago

Yeah, I don't think that will be all that bad. I'm a bit confused how that can happen, since everything is called in the right order, but if we have a reasonable reproducing test case this should be relatively straightforward. Glancing at it I'm guessing it's something like the list of queues in the initial config got cleared or some generated name changed or... 🤷 If we add in a bit of context to figure out which queue is missing and which queues exist it should fall out quickly.

garlick commented 4 months ago

Just following up on the meeting today:

fluxion gets the initial Rv1 object from the resource.acquire RPC, described in RFC 28. The Rv1 object is returned in the first response payload. This is designed so resources are automatically returned when fluxion unloads and can be re-acquired when it loads again. https://github.com/flux-framework/flux-sched/blob/fe872c8dc056934e4073b5fb2932335bb69ca73a/resource/modules/resource_match.cpp#L1282

Fluxion then uses the job-manager.hello RPC to acquire a list of running jobs. That one is described in RFC 27. Although the Rv1 object allocated to each job is not provided in the response payload, the libschedutil helper library from flux-core fetches it from the kvs and provides it to the scheduler in the hello callback. Thus the wall clock expiration time for each allocated Rv1 is available. https://github.com/flux-framework/flux-sched/blob/fe872c8dc056934e4073b5fb2932335bb69ca73a/qmanager/modules/qmanager_callbacks.cpp#L123

So my probably naive question is what required information is missing to allow fluxion to restart with running jobs and rv1_nosched?

garlick commented 3 weeks ago

In our meeting today it was asserted that the Rv1 to graph uuid mapping would need to be preserved (in a file or KVS) across a restart in order to map the resources of running jobs to the resource graph. Afterward I'm once again wondering why. The uuid's are purely internal to fluxion - fluxion won't receive a uuid from a running job that was allocated from a past instance of fluxion and need to map it to resources. It will receive rv1 with hostname/ranks and gpu/core logical indices. I don't see how having the old uuid mapping would help.

What am I missing?

milroy commented 3 weeks ago

I ended up deciding the vertex uniq_id (I mistakenly called it a UUID) wasn't necessary to identify the underlying graph vertex. I actually think it is possible to reconstruct the vertices' information from RV1 with hostnames and logical indices of a rank's children provided that the original resource graph was also generated from the rv1exec reader.

However, edge data (such as exclusivity tracking) can't be reconstructed without a fully specified format like JGF. Here's an example of what the JGF reader does to update the edges: https://github.com/flux-framework/flux-sched/blob/c8e03f8416b3d7803c6108de5ea8f15292728568/resource/readers/resource_reader_jgf.cpp#L1048. That said, I don't know if not setting those parameters will result in undesirable scheduling behavior or only in reduction of optimizations.

garlick commented 3 weeks ago

Thanks for that explanation! Well, I think having the rv1 reader, even a naive one, would be an excellent near term step since it would let the scheduler be unloaded and reconfigured or even updated without affecting running jobs. When the scheduler is misbehaving, that could be really handy.

Edit: we could always augment that with some side band storage or whatever if it turns out to be required.