Open grondo opened 5 days ago
Example of housekeeping errors:
job-manager.err[0]: housekeeping: tuolumneXXX (rank XXX) fALfUWKpRdZ: No route to host
job-manager.err[0]: housekeeping: tuolumneYYY (rank YYY) fALfUW3WZXm: No route to host
job-manager.err[0]: housekeeping: tuolumneZZZ (rank ZZZ) fALfQAq4Fvo: No route to host
Wow that is really strange. It's almost as if the job manager's running job count is just wrong. Looking through that code, it's hard to see how it could be, at least not without the job-list count also being wrong since both counts are driven by job events.
The housekeeping errors are probably to be expected and shouldn't be related to the running job count since housekeeping starts when the job transitions to INACTIVE.
The housekeeping errors are probably to be expected and shouldn't be related to the running job count since housekeeping starts when the job transitions to INACTIVE.
Ok, I only mentioned that since it was the only error I saw in the logs.
On tuolumne,
flux shutdown
was stuck influx queue idle
. There are no running jobs known tojob-list
, butflux queue status -v
shows 80 running jobs:One thing that I notice in the logs is that nodes were shutdown while housekeeping was running.
I have no idea if that is related.
I can't seem to get any more information out of the system so I'm just going to kill off the
flux queue idle
process and let the system shutdown.