genome-nexus / genome-nexus-vep

Java spring boot wrapper around VEP
https://www.genomenexus.org
MIT License
3 stars 14 forks source link

Managing failed VEP CLI child processes #6

Closed sheridancbio closed 4 years ago

sheridancbio commented 4 years ago

We know that when a VEP wrapper-VEP CLI process fails, the CLI doesn't properly clean child processes. We need a way to kill these processes before resources are exhausted.

sheridancbio commented 4 years ago

@averyniceday and @mandawilson did testing with me to clarify this issue.

The vep command line tool packaged in to the built docker image will fork a number of child processes while running. When VepRunner creates the initial process running the vep command line tool, it enters a loop where it waits for a short period of time for the process to exit. After each iteration it compares the current time to a calculated future time at which the process should be killed if it has not yet completed.

The code will kill the process by a java call to Process.destroy(). We observed that as the process runs, child processes complete and will be reforked up to the fork count limit. The completed processes accumulate as unix zombie processes. This is the picture about one minute into a normal run:

$ ps axo pid,ppid,pgid,sid,state,cmd | cat PID PPID PGID SID S CMD 1 0 1 1 S java -jar vep_wrapper.jar 61 0 61 61 S /bin/bash 78 1 1 1 Z [perl] <defunct> 79 1 1 1 Z [perl] <defunct> 80 1 1 1 Z [perl] <defunct> 81 1 1 1 Z [perl] <defunct> 82 1 1 1 Z [perl] <defunct> 84 1 1 1 Z [perl] <defunct> 85 1 1 1 Z [perl] <defunct> 88 1 1 1 Z [perl] <defunct> 89 1 1 1 Z [perl] <defunct> 90 1 1 1 Z [perl] <defunct> 91 1 1 1 Z [perl] <defunct> 112 1 1 1 Z [perl] <defunct> 113 1 1 1 Z [perl] <defunct> 114 1 1 1 Z [perl] <defunct> 115 1 1 1 Z [perl] <defunct> 116 1 1 1 Z [perl] <defunct> 119 1 1 1 Z [perl] <defunct> 120 1 1 1 Z [perl] <defunct> 121 1 1 1 Z [perl] <defunct> 122 1 1 1 Z [perl] <defunct> 123 1 1 1 Z [perl] <defunct> 124 1 1 1 Z [perl] <defunct> 125 1 1 1 Z [perl] <defunct> 126 1 1 1 Z [perl] <defunct> 129 1 1 1 Z [perl] <defunct> 130 1 1 1 Z [perl] <defunct> 149 1 1 1 S perl /opt/vep/src/ensembl-vep/vep --cache --offline --everything --hgvsg --xref_refseq --assembly GRCh37 --format region --fork 4 --fasta /opt/vep/.vep/homo_sapiens/98_GRCh37/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz --json -i /opt/vep/.vep/tmp/vep_input-8187024126048823470.txt -o STDOUT --no_stats 155 149 1 1 R perl /opt/vep/src/ensembl-vep/vep --cache --offline --everything --hgvsg --xref_refseq --assembly GRCh37 --format region --fork 4 --fasta /opt/vep/.vep/homo_sapiens/98_GRCh37/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz --json -i /opt/vep/.vep/tmp/vep_input-8187024126048823470.txt -o STDOUT --no_stats 156 149 1 1 R perl /opt/vep/src/ensembl-vep/vep --cache --offline --everything --hgvsg --xref_refseq --assembly GRCh37 --format region --fork 4 --fasta /opt/vep/.vep/homo_sapiens/98_GRCh37/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz --json -i /opt/vep/.vep/tmp/vep_input-8187024126048823470.txt -o STDOUT --no_stats 157 149 1 1 R perl /opt/vep/src/ensembl-vep/vep --cache --offline --everything --hgvsg --xref_refseq --assembly GRCh37 --format region --fork 4 --fasta /opt/vep/.vep/homo_sapiens/98_GRCh37/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz --json -i /opt/vep/.vep/tmp/vep_input-8187024126048823470.txt -o STDOUT --no_stats 158 149 1 1 R perl /opt/vep/src/ensembl-vep/vep --cache --offline --everything --hgvsg --xref_refseq --assembly GRCh37 --format region --fork 4 --fasta /opt/vep/.vep/homo_sapiens/98_GRCh37/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz --json -i /opt/vep/.vep/tmp/vep_input-8187024126048823470.txt -o STDOUT --no_stats 159 149 1 1 R perl /opt/vep/src/ensembl-vep/vep --cache --offline --everything --hgvsg --xref_refseq --assembly GRCh37 --format region --fork 4 --fasta /opt/vep/.vep/homo_sapiens/98_GRCh37/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz --json -i /opt/vep/.vep/tmp/vep_input-8187024126048823470.txt -o STDOUT --no_stats 162 61 162 61 R ps axo pid,ppid,pgid,sid,state,cmd 163 61 162 61 S cat

The zombie processes have been detached from the launched perl process and are now children of process 1 (which on the running docker container is the java jvm process). These are not automatically cleaned up by the JVM as they would be by the normal unix init process.

Additionally, when the launched perl process is killed with Process.destroy(), that process (pid #149 above) is indeed eliminated, but the forked child processes (pid #155, #156, #157, #158, #159) remain on the system in a Sleep/wait state. When their parent (#149) is destroyed, these child processes become children to process #1.

sheridancbio commented 4 years ago

One improvement we discovered was to change from using Process.destroy() to using Process.destroyForcibly(). Using that call, the zombies of completed forked child processes are cleaned up, but the running child processes remain on the system in the (Interruptible)Sleep state "S".

We discovered that java9 includes support for more detailed processes control on unix systems, such as obtaining PID numbers. However, in order to use those features we would have needed to update our development and deployment platforms to use openjdk-9-jre or openjdk-11-jre. We experimented briefly with this but decided to not go down the path of a general platform migration.

Instead, we found a suggestion that java reflection could allow the determination of process id on a unix platform. That was developed and the current strategy we are pursuing is to identify the PID of the launched vep process and to track it as a desired process, and then to periodically scan the list of running processes using the "ps" utility and to forcibly kill all processes

sheridancbio commented 4 years ago

The code strategy for our docker container deployment is now largely in place.

Processes which are running are tracked in a list of launched processes by SystemProcessManager. This allows distinguishing the launched sleeping processes from sleeping child processes of launched but later destroyed VEP process threads (causing the children to be reassigned as children of the JVM process).

The system determines what abilities are available on the current system and only attempt to do orphaned child process killing and zombie reaping if the needed capabilities are available:

It is observed that when the final orphan child process of a launched VEP process is killed, the stdout stream of the VEP command line tool will finally be closed by the system. So the forceful interruption of the StreamProcessor threads has been simplified to simply wait until the stream processor threads have completed processing and exited.

sheridancbio commented 4 years ago

Several small bugs have been fixed. One new bug has been discovered during testing today. This code in the section which removes (normally) completed processes from the launchedProcesses collection has an inherent bug:

for (Process p : launchedProcesses) { if (!p.isAlive()) { launchedProcesses.remove(p); }

Modifying a collection in this way, within a loop which is iterating over the collection, generates this exception: Exception in thread "ReaperDaemon" java.util.ConcurrentModificationException at java.util.HashMap$HashIterator.nextNode(HashMap.java:1445) at java.util.HashMap$KeyIterator.next(HashMap.java:1469) at org.genomenexus.vep_wrapper.SystemProcessManager.run(SystemProcessManager.java:145)

This will be corrected.

sheridancbio commented 4 years ago

Correction for above bug completed with an iterator.

Also added request for shutdown in order to exit the reaper daemon thread.

Documentation has been completed.