flux-framework / flux-core

core services for the Flux resource management framework
GNU Lesser General Public License v3.0
167 stars 50 forks source link

Suggestion: in flux jobs, show jobs with nodes still in housekeeping #6248

Open kkier opened 1 month ago

kkier commented 1 month ago

There's a corner case we keep running into (which I guess makes it a very wide corner):

I realize this is covered by also checking flux resource list for nodes marked as allocated, so the change would be more of a QoL improvement for users and admins than anything else. But since it's my life we're talking about, naturally I'm all for it.

Possible implementation - new job state between CLEANUP and INACTIVE.

garlick commented 1 month ago

One thing we talked about when housekeeping was being designed was representing housekeeping as a separate system job running as the flux user. I started to implement that but found it was more challenging than expected, and we needed something to staunch the bleeding on el cap. Maybe reviving that could address this issue without introducing unnecessary coupling between the original job and housekeeping.

grondo commented 1 month ago

I haven't looked into how difficult this would be, but maybe flux resource status could be augmented to show nodes that are currently in housekeeping? That might be more straightforward than trying to represent housekeeping as a job, and we do already support the ephemeral torpid state in flux resource status as a precedent.

I can't decide which is the "right" approach here though. Representing the housekeeping workload as a job is an attractive option, but since it isn't a job there would be so many "job" things that won't work it almost seems like it could possibly cause more trouble down the road...

garlick commented 1 month ago

Hmm yeah, that is partly what made the job idea hard. Going all the way and making it a real job with all the trimmings seems like overkill.

grondo commented 1 month ago

It is trivial to add a new housekeeping state to flux resource status, e.g.:

$ src/cmd/flux resource status
       STATE UP NNODES NODELIST
       avail  ✔    101 corona[171,173-186,188-194,196-207,213-214,219,221-230,232-250,252-253,255-259,261-269,272-275,277-278,280,282-285,287-290,292-294,296]
     exclude  ✔      4 corona[81-82,211-212]
    exclude*  ✗      1 corona260
housekeeping  ✔      2 corona[189-190]
    drained*  ✗      7 corona[172,187,231,254,279,291,295]
     drained  ✔     12 corona[195,215-218,220,251,270-271,276,281,286]

note that the avail state in flux resource status doesn't mean available to jobs now, but the nodes that are not excluded, drained, or in v0.66.0 and later, currently marked as torpid. So the avail node set does include the housekeeping node set, unless we think that should change?

Also, for some reason, flux resource status doesn't currently support listing allocated nodes, even though it has the information. If we enable that, then flux resource status -s all could be used to display all states including any allocated nodes (which includes those in housekeeping)

At this point, I'm not sure if the above is helpful or not. It is easy to include it, and maybe put neither housekeeping nor allocated in the default output. To get full details you'll need to run flux resource status -s all and then be cognizant of the fact that some of the sets may have some overlap.

I'm open to any feedback here.