classner / pymp

Easy, OpenMP style multiprocessing for Python on Unix.
MIT License
280 stars 29 forks source link

Child processes die without error #8

Closed susannasiebert closed 5 years ago

susannasiebert commented 5 years ago

We've been using pymp to implement nested parallelization for one of our long-running subroutines. The code can be found here (https://github.com/griffithlab/pVACtools/blob/master/lib/pipeline.py#L335). Unfortunately, we've been running into a couple of problems: 1) One of the child processes will encounter an error but the main process will not die. Instead it will hang indefinitely. 2) Child processes will disappear. The main process will hang indefinitely.

mmclella   9919  0.0  0.0   4192   756 ?        Ss   Mar12   0:01 /usr/bin/tini -- bash run_pvacvector_vaccine_test_lowest_v1-4-0d-v17_b500_33mer.sh
mmclella  10028  0.0  0.0  11316  3004 ?        S    Mar12   0:00  \_ bash run_pvacvector_vaccine_test_lowest_v1-4-0d-v17_b500_33mer.sh
mmclella  10044  0.2  0.1 11351820 628152 ?     Sl   Mar12   2:34      \_ /opt/conda/bin/python /opt/conda/bin/pvacvector run -m lowest -k -t 16 -b 500 -e 8,9,10,11 --iedb-install-directory /opt/ie
mmclella  10992  0.0  0.0  44520 10820 ?        S    Mar12   0:00          \_ /opt/conda/bin/python -c from multiprocessing.semaphore_tracker import main;main(6)
mmclella  11055  0.0  0.0 9006028 140852 ?      Sl   Mar12   0:06          \_ /opt/conda/bin/python /opt/conda/bin/pvacvector run -m lowest -k -t 16 -b 500 -e 8,9,10,11 --iedb-install-directory /op
mmclella  16691  0.1  0.1 11331536 563368 ?     Sl   Mar12   1:43          \_ /opt/conda/bin/python /opt/conda/bin/pvacvector run -m lowest -k -t 16 -b 500 -e 8,9,10,11 --iedb-install-directory /op
mmclella  57962  0.0  0.1 11321892 528464 ?     S    Mar12   0:04          |   \_ /opt/conda/bin/python /opt/conda/bin/pvacvector run -m lowest -k -t 16 -b 500 -e 8,9,10,11 --iedb-install-directory
mmclella  58024  0.0  0.0      0     0 ?        Z    Mar12   0:05          |       \_ [pvacvector] <defunct>
mmclella  58038  0.0  0.0      0     0 ?        Z    Mar12   0:00          |       \_ [pvacvector] <defunct>
mmclella  21258  0.0  0.0 10392616 481448 ?     Sl   Mar12   0:06          \_ /opt/conda/bin/python /opt/conda/bin/pvacvector run -m lowest -k -t 16 -b 500 -e 8,9,10,11 --iedb-install-directory /op
mmclella  21305  0.0  0.0 10406620 521644 ?     S    Mar12   0:04          \_ /opt/conda/bin/python /opt/conda/bin/pvacvector run -m lowest -k -t 16 -b 500 -e 8,9,10,11 --iedb-install-directory /op
mmclella  21334  0.0  0.0      0     0 ?        Z    Mar12   0:04          |   \_ [pvacvector] <defunct>
mmclella  21337  0.0  0.0      0     0 ?        Z    Mar12   0:00          |   \_ [pvacvector] <defunct>
mmclella  36657  0.0  0.0 10421580 510900 ?     Sl   Mar12   0:06          \_ /opt/conda/bin/python /opt/conda/bin/pvacvector run -m lowest -k -t 16 -b 500 -e 8,9,10,11 --iedb-install-directory /op
mmclella   5567  0.0  0.1 11351820 537380 ?     Sl   Mar12   0:05          \_ /opt/conda/bin/python /opt/conda/bin/pvacvector run -m lowest -k -t 16 -b 500 -e 8,9,10,11 --iedb-install-directory /op

Unfortunately, these situations seems to be non-deterministic so I haven't been able to construct a test case that can reliably reproduce this problem. Could you have a look at our implementation and see if I did something incorrectly or if you have any other suggestions on how to debug this problem?

classner commented 5 years ago

Hi Susanna,

I'm glad you find the library useful! Hmm, this will be hard to reproduce and track down. Child processes dying is a bad thing to happen. I tried to convey error messages as clear and direct as possible to the parent process, but dying processes may be tough. Is there any way you can prevent the child processes from dying in the first place and instead return gracefully?

As for debugging tips, you can use if_ in a parallel section to temporarily disable parallel processing for debugging. I found this very helpful in many occasions because then you get stacktraces for the errors.

classner commented 5 years ago

I can only guess that the script will be stuck here: https://github.com/classner/pymp/blob/master/pymp/__init__.py#L128 waiting for the subprocess to finish. However, waitpid should be reliable, even when the process 'dies' instead of properly stopping. If you want to dig deeper, use the logging module and set the loglevel to DEBUG for the pymp module.

susannasiebert commented 5 years ago

Thank you for the pointers. We're still not quite sure what was happening but we were able to resolve this issue by switching away from nested to multi-threading to single-level multithreading.

classner commented 5 years ago

Glad that you could resolve it! Let me know should you have further questions or insights on this.