flux-framework / flux-core

core services for the Flux resource management framework
GNU Lesser General Public License v3.0
167 stars 50 forks source link

Flux stuck during shutdown, `flux queue status -v` shows many jobs running #6344

Open grondo opened 5 days ago

grondo commented 5 days ago

On tuolumne, flux shutdown was stuck in flux queue idle. There are no running jobs known to job-list, but flux queue status -v shows 80 running jobs:

# flux queue status -v | grep running
80 running jobs
# flux jobs --stats-only
0 running, 

One thing that I notice in the logs is that nodes were shutdown while housekeeping was running.

I have no idea if that is related.

I can't seem to get any more information out of the system so I'm just going to kill off the flux queue idle process and let the system shutdown.

grondo commented 5 days ago

Example of housekeeping errors:

job-manager.err[0]: housekeeping: tuolumneXXX (rank XXX) fALfUWKpRdZ: No route to host
job-manager.err[0]: housekeeping: tuolumneYYY (rank YYY) fALfUW3WZXm: No route to host
job-manager.err[0]: housekeeping: tuolumneZZZ (rank ZZZ) fALfQAq4Fvo: No route to host
garlick commented 5 days ago

Wow that is really strange. It's almost as if the job manager's running job count is just wrong. Looking through that code, it's hard to see how it could be, at least not without the job-list count also being wrong since both counts are driven by job events.

The housekeeping errors are probably to be expected and shouldn't be related to the running job count since housekeeping starts when the job transitions to INACTIVE.

grondo commented 5 days ago

The housekeeping errors are probably to be expected and shouldn't be related to the running job count since housekeeping starts when the job transitions to INACTIVE.

Ok, I only mentioned that since it was the only error I saw in the logs.