flux-framework / flux-core

core services for the Flux resource management framework
GNU Lesser General Public License v3.0
168 stars 50 forks source link

housekeeping: need a way to indicate partial release in RFC 27 hello protocol #6089

Open garlick opened 4 months ago

garlick commented 4 months ago

Problem: housekeeping cancels in-progress work if the scheduler is reloaded with a housekeeping job running that has already released some of its resources.

Following up on an offline discussion with @trws, @milroy, et al yesterday:

RFC 27 defines the per job hello payload as just e.g.

{
  "id": 1552593348,
  "priority": 43444,
  "userid": 5588,
  "t_submit": 1552593348.073045,
}

libschedutil fetches R for job id from the KVS and passes it to the scheduler hello callback.

Hence

Could we amend RFC 27 to add an optional idset field to the above payload that indicates a mask to apply to R to get the currently allocated resource set? If missing, assume the entire R is allocated?

I think this is one of the options discussed but I do not remember what you had to say about it @milroy.

Remember reload with running jobs should be relatively rare, so the cost of this is not really critical IMHO.

garlick commented 4 days ago

This just came up on tuo. Housekeeping couldn't reach a subset of nodes so they were immediately returned to the scheduler. Then the scheduler was reloaded while housekeeping was running on the remainder. Then the hello protocol hit

<jobid> will be terminated because partial release is not supported by RFC 27 hello

which caused the housekeeping units to get a SIGTERM. Then all those nodes were drained.

We need to fix this.