Open kkier opened 1 month ago
One thing we talked about when housekeeping was being designed was representing housekeeping as a separate system job running as the flux user. I started to implement that but found it was more challenging than expected, and we needed something to staunch the bleeding on el cap. Maybe reviving that could address this issue without introducing unnecessary coupling between the original job and housekeeping.
I haven't looked into how difficult this would be, but maybe flux resource status
could be augmented to show nodes that are currently in housekeeping? That might be more straightforward than trying to represent housekeeping as a job, and we do already support the ephemeral torpid
state in flux resource status
as a precedent.
I can't decide which is the "right" approach here though. Representing the housekeeping workload as a job is an attractive option, but since it isn't a job there would be so many "job" things that won't work it almost seems like it could possibly cause more trouble down the road...
Hmm yeah, that is partly what made the job idea hard. Going all the way and making it a real job with all the trimmings seems like overkill.
It is trivial to add a new housekeeping
state to flux resource status
, e.g.:
$ src/cmd/flux resource status
STATE UP NNODES NODELIST
avail ✔ 101 corona[171,173-186,188-194,196-207,213-214,219,221-230,232-250,252-253,255-259,261-269,272-275,277-278,280,282-285,287-290,292-294,296]
exclude ✔ 4 corona[81-82,211-212]
exclude* ✗ 1 corona260
housekeeping ✔ 2 corona[189-190]
drained* ✗ 7 corona[172,187,231,254,279,291,295]
drained ✔ 12 corona[195,215-218,220,251,270-271,276,281,286]
note that the avail
state in flux resource status
doesn't mean available to jobs now, but the nodes that are not excluded, drained, or in v0.66.0 and later, currently marked as torpid. So the avail
node set does include the housekeeping
node set, unless we think that should change?
Also, for some reason, flux resource status
doesn't currently support listing allocated
nodes, even though it has the information. If we enable that, then flux resource status -s all
could be used to display all states including any allocated
nodes (which includes those in housekeeping
)
At this point, I'm not sure if the above is helpful or not. It is easy to include it, and maybe put neither housekeeping
nor allocated
in the default output. To get full details you'll need to run flux resource status -s all
and then be cognizant of the fact that some of the sets may have some overlap.
I'm open to any feedback here.
There's a corner case we keep running into (which I guess makes it a very wide corner):
flux resource status
shows N nodes availableflux jobs -A
shows no running jobsI realize this is covered by also checking
flux resource list
for nodes marked as allocated, so the change would be more of a QoL improvement for users and admins than anything else. But since it's my life we're talking about, naturally I'm all for it.Possible implementation - new job state between CLEANUP and INACTIVE.