Closed jchodera closed 2 years ago
Our LSF JOB_TERMINATE_INTERVAL
is set to 60 seconds, suggesting we're running into one of three issues here:
SIGINT
), orSIGINT
I will dig into the manager code, but from what I know, the managers do not directly deal with files - that is more the domain of QCEngine and the individual harnesses.
But that just moves your question. I will look at qcengine as well. Do you know which quantum program was running?
@bennybp I can take this on if you like; already investigating the issue with a local manager. It does appear that for e.g. psi4
optimizations a SIGTERM sent to the manager fails to kill the optimization.
I'm chasing the logic down through the manager code to see if the signal isn't being passed through to the process pool correctly. That is my working hypothesis for where the ball is being dropped.
@dotsdl : Would be great if you can investigate this and see if we can gracefully handle SIGINT
to trigger graceful shutdown and cleanup of temporary files across QCFractal -> QCEngine -> psi4.
@jchodera I'm on it. Aiming to identify where the signal isn't being passed downward, followed by a solution PR.
I believe I've identified issue. In ExecutorAdaptor.close, all futures in the queue are cancelled. However, according to the concurrent.futures docs, cancelling does nothing for tasks that are already executing. This makes sense, and is also consistent with the behavior we are seeing. For long-running tasks such as optimizations, this means the manager is kept alive until the optimization finishes.
I see at least two reasonable solutions to this.
concurrent.futures.ProcessPoolExecutor
for multiprocessing.pool.Pool
. concurrent.futures
doesn't expose a way to kill workers with inflight tasks; multiprocessing
does.These are not mutually-exclusive. I've desired (2) for some time, and could implement this as a fallback if a process pool causes issues and we are only using one worker at a time anyway, like we do on e.g. Lilac in most cases.
We're having some trouble where jobs killed by LSF are not cleaning up their temporary files. I notice that this method can register a signal handler and gracefully clean up when a signal is received.
According to this page, the LSF job manager first sends a SIGINT, waits a bit, then sends a SIGKILL, and finally a SIGTERM.
Does QCFractal already gracefully intercept SIGINT and clean up its temporary files (e.g. psi4 tempfiles) and exit? Or would this behavior have to be added?
cc: @dotsdl