MolSSI / QCFractal

A distributed compute and database platform for quantum chemistry.
https://molssi.github.io/QCFractal/
BSD 3-Clause "New" or "Revised" License
144 stars 47 forks source link

Replace use of `concurrent.futures` with `multiprocessing` in `ExecutorAdaptor` for hard-termination of tasks #700

Closed dotsdl closed 2 years ago

dotsdl commented 2 years ago

Closes #699.

Swaps out concurrent.futures.ProcessPoolExecutor for multiprocessing.Pool.

We needed a way to hard-terminate processes without waiting for them to complete. multiprocessing gives us this.

Changelog description

ExecutorAdaptor now uses multiprocessing.Pool instead of concurrent.futures.ProcessPoolExecutor, and hard-terminates tasks on manager shutdown.

Status

dotsdl commented 2 years ago

One note: for the case of optimizations, which can be long-running tasks that we sought to address here, a gradient calculation with e.g. psi4 will not immediately terminate upon exit of the manager. This is due to the fact that killing the processes in the pool does not kill any child processes, such as those started via subprocess.Popen.

I don't see a clean way to propagate the SIGTERM that qcengine.compute_procedure receives in the process pool down through Psi4Harness.compute and into the subprocess wait block to terminate the child process. However, the current solution at least eliminates the manager immediately so that file cleanup can occur shortly after. The remaining child process is orphaned and will be handled by PID 1 on the host.

dotsdl commented 2 years ago

Edit: I do have a solution that can cleanly terminate the underlying subprocess.Popen. Will formulate as a separate PR here and to QCEngine.

lgtm-com[bot] commented 2 years ago

This pull request introduces 1 alert when merging 4d196603cd534adc0b8c8f7579aaee24abf46bf7 into 00d989919419b8ad7893631e7b5224f605310aab - view on LGTM.com

new alerts:

dotsdl commented 2 years ago

From live testing with a local qcfractal-manager, we only need the changes introduced in this PR to propagate KeyboardInterrupt into the process pool workers that execute calls to QCEngine compute and compute_procedure. Sending a SIGINT, SIGHUP, or SIGTERM to the manager cleanly kills the manager and child psi4 calls down through QCEngine with a process pool, and all scratch files are cleaned up.

@bennybp we are getting CI failures from psql with all environments. It is not a fundamental issue with the conda-forge postgresql package; installing it alone with e.g.:

conda create -n postgres -c conda-forge postgresql python=3.7

gives a psql --version call that works. Is there a dependency in the CI envs that could be causing the error we're seeing in the CI (reproduced locally):

psql: symbol lookup error: /home/david/.conda/envs/qcfractal-base/bin/../lib/libreadline.so.8: undefined symbol: tputs

?

dotsdl commented 2 years ago

Or alternatively, should we be restricting the postgresql version in these environments to some max version?

lgtm-com[bot] commented 2 years ago

This pull request introduces 1 alert when merging 3162eb764abaf5da918ea4437ad3668e4d1f57cf into 00d989919419b8ad7893631e7b5224f605310aab - view on LGTM.com

new alerts:

lgtm-com[bot] commented 2 years ago

This pull request introduces 1 alert when merging 13f8878d0b5e823a11665050a843494e97f6011a into 00d989919419b8ad7893631e7b5224f605310aab - view on LGTM.com

new alerts:

lgtm-com[bot] commented 2 years ago

This pull request introduces 1 alert when merging 5fe7d544025f7fbac9c2619317796a725d1d607f into 00d989919419b8ad7893631e7b5224f605310aab - view on LGTM.com

new alerts:

lgtm-com[bot] commented 2 years ago

This pull request introduces 1 alert when merging d24cc4e9ee28f67b221d80af574fc56dc7f4356f into 00d989919419b8ad7893631e7b5224f605310aab - view on LGTM.com

new alerts:

dotsdl commented 2 years ago

Not sure what the root cause of the CI failures are now. Superficially they appear unrelated to manager execution, but not certain.

codecov[bot] commented 2 years ago

Codecov Report

Merging #700 (6e77453) into master (8b96203) will increase coverage by 0.00%. The diff coverage is 92.15%.

lgtm-com[bot] commented 2 years ago

This pull request introduces 1 alert when merging 17a3bcfb2a0bd6b0e2774393aa07ba60aef5d294 into 00d989919419b8ad7893631e7b5224f605310aab - view on LGTM.com

new alerts:

lgtm-com[bot] commented 2 years ago

This pull request introduces 1 alert when merging 94739c168f3ec4820f642a5d73835cf4e3232b7a into 00d989919419b8ad7893631e7b5224f605310aab - view on LGTM.com

new alerts:

dotsdl commented 2 years ago

@bennybp looks like we're good to go here! Thanks again for all your help!

dotsdl commented 2 years ago

@bennybp can you review and merge? If we get this and #701 in we should be able to cut a release that we can roll out to production managers.

lgtm-com[bot] commented 2 years ago

This pull request introduces 1 alert when merging 6e774536404da7258ecf26c921338f42a89713fa into 8b96203ce0e28cf6322a599af0ef87c5b534659a - view on LGTM.com

new alerts:

dotsdl commented 2 years ago

Looks like all tests still pass with both QCElemental 0.24 and QCEngine 0.22!

dotsdl commented 2 years ago

Awesome, thank you @bennybp!