flux-framework / flux-core

core services for the Flux resource management framework
GNU Lesser General Public License v3.0
159 stars 49 forks source link

job-manager possibly sends alloc requests after jobs have been canceled #6051

Open grondo opened 1 week ago

grondo commented 1 week ago

In flux-framework/flux-sched#1222 @trws observed

the job-manager processes all the cancels, but keeps sending all the no-longer-valid alloc requests anyway, for something like 10 minutes, before we start getting cancels to remove them.

One theory proposed is that

I wonder if the alloc requests have all already been sent (I think Fluxion uses the unlimited alloc limit, so no alloc requests are "held back") and are sitting in the message queue of the qmanager? We could perhaps check if the behavior is duplicated with sched-simple.

This issue is open to investigate the situation to ensure the job-manager isn't doing something wrong here.

garlick commented 1 week ago

It may be useful to add some more stats to flux module stats job-manager such as the number of pending sched.alloc and sched.cancel requests, where the latter is defined as a sched.alloc request that is still pending even after a cancel request has been sent.

I don't see anything that prevents multiple sched.cancel requests from being sent for the same pending alloc request although I'm not sure in what situation that would occur.