MolSSI / QCFractal

A distributed compute and database platform for quantum chemistry.
https://molssi.github.io/QCFractal/
BSD 3-Clause "New" or "Revised" License
144 stars 47 forks source link

Are temporary files automatically deleted when SIGINT received? #699

Closed jchodera closed 2 years ago

jchodera commented 2 years ago

We're having some trouble where jobs killed by LSF are not cleaning up their temporary files. I notice that this method can register a signal handler and gracefully clean up when a signal is received.

According to this page, the LSF job manager first sends a SIGINT, waits a bit, then sends a SIGKILL, and finally a SIGTERM.

Does QCFractal already gracefully intercept SIGINT and clean up its temporary files (e.g. psi4 tempfiles) and exit? Or would this behavior have to be added?

cc: @dotsdl

jchodera commented 2 years ago

Our LSF JOB_TERMINATE_INTERVAL is set to 60 seconds, suggesting we're running into one of three issues here:

  1. We're running out of time for file cleanup, and need to see if this should be increased
  2. We're intercepting the wrong signal (we should terminate gracefully on SIGINT), or
  3. We're not actually cleaning up when we intercept SIGINT
bennybp commented 2 years ago

I will dig into the manager code, but from what I know, the managers do not directly deal with files - that is more the domain of QCEngine and the individual harnesses.

But that just moves your question. I will look at qcengine as well. Do you know which quantum program was running?

dotsdl commented 2 years ago

@bennybp I can take this on if you like; already investigating the issue with a local manager. It does appear that for e.g. psi4 optimizations a SIGTERM sent to the manager fails to kill the optimization.

I'm chasing the logic down through the manager code to see if the signal isn't being passed through to the process pool correctly. That is my working hypothesis for where the ball is being dropped.

jchodera commented 2 years ago

@dotsdl : Would be great if you can investigate this and see if we can gracefully handle SIGINT to trigger graceful shutdown and cleanup of temporary files across QCFractal -> QCEngine -> psi4.

dotsdl commented 2 years ago

@jchodera I'm on it. Aiming to identify where the signal isn't being passed downward, followed by a solution PR.

dotsdl commented 2 years ago

I believe I've identified issue. In ExecutorAdaptor.close, all futures in the queue are cancelled. However, according to the concurrent.futures docs, cancelling does nothing for tasks that are already executing. This makes sense, and is also consistent with the behavior we are seeing. For long-running tasks such as optimizations, this means the manager is kept alive until the optimization finishes.

I see at least two reasonable solutions to this.

  1. Swap out concurrent.futures.ProcessPoolExecutor for multiprocessing.pool.Pool. concurrent.futures doesn't expose a way to kill workers with inflight tasks; multiprocessing does.
  2. Create an in-process, serial execution QueueAdapter that doesn't delegate to any separate processes. Use this for e.g. Lilac.

These are not mutually-exclusive. I've desired (2) for some time, and could implement this as a fallback if a process pool causes issues and we are only using one worker at a time anyway, like we do on e.g. Lilac in most cases.