flux-framework / flux-sched

Fluxion Graph-based Scheduler
GNU Lesser General Public License v3.0
84 stars 39 forks source link

DOWN vertices not reconsidered after coming back UP #1180

Open jameshcorbett opened 1 month ago

jameshcorbett commented 1 month ago

On rzadams I marked a rabbit vertex as down, and submitted a job that required a compute node on the same rack as that rabbit. The job, as expected, was stuck in SCHED. I then marked the rabbit vertex as UP, but the job remained stuck in SCHED. A new job went through fine.

It seems to me that the original job should have had its resource request reconsidered at some point?

milroy commented 4 weeks ago

Can you provide some of the job details and resource requests? Did the job have constraints? Also, which resource reader is Fluxion using on rzadams?

milroy commented 4 weeks ago

Also, which resource reader is Fluxion using on rzadams?

After a bit more thought, it has to be JGF given the use of rabbits.

milroy commented 4 weeks ago

A better question is what is the configured match policy?

jameshcorbett commented 4 weeks ago

I'll check the match policy, but I'll also see if I can reproduce locally.

trws commented 2 weeks ago

Having just gone through this, my best bet would be that somehow the resource never got marked as UP in resource. We definitely reconsider jobs when resources change state, in fact we reconsider all jobs when a resource is set DOWN even, so it's probably a failure to propagate that state or an issue with the matching.