Open garlick opened 4 months ago
This just came up on tuo. Housekeeping couldn't reach a subset of nodes so they were immediately returned to the scheduler. Then the scheduler was reloaded while housekeeping was running on the remainder. Then the hello protocol hit
<jobid> will be terminated because partial release is not supported by RFC 27 hello
which caused the housekeeping units to get a SIGTERM. Then all those nodes were drained.
We need to fix this.
Problem: housekeeping cancels in-progress work if the scheduler is reloaded with a housekeeping job running that has already released some of its resources.
Following up on an offline discussion with @trws, @milroy, et al yesterday:
RFC 27 defines the per job
hello
payload as just e.g.libschedutil fetches R for job
id
from the KVS and passes it to the scheduler hello callback.Hence
Could we amend RFC 27 to add an optional idset field to the above payload that indicates a mask to apply to R to get the currently allocated resource set? If missing, assume the entire R is allocated?
I think this is one of the options discussed but I do not remember what you had to say about it @milroy.
Remember reload with running jobs should be relatively rare, so the cost of this is not really critical IMHO.