adamkewley / jobson

A platform for transforming command-line applications into a job service.
Apache License 2.0
256 stars 20 forks source link

Job keep running #67

Closed Russel88 closed 3 years ago

Russel88 commented 3 years ago

Hi Adam

I have a weird problem that I hope you can help me solve. I have a script that runs fine when I invoke it through the command-line, but when the exact same script with same arguments is invoked by jobson it stalls, that is, it just keeps saying "Running" forever.

The part of the script that stalls is some perl commands that retrieves information from a website.

Now to the weird part. The output from the problematic code block is directed to a file. That file is empty while jobson says "Running", but as soon as I abort the job, the expected output is written to the file, but the rest of the script is of course not run.

Do you have any idea what could be causing this, and is it even jobson related? I have tried putting the problematic code in a script by it self and invoking that in a sub shell to no avail.

Any help is much appreciated!

Regards, Jakob

adamkewley commented 3 years ago

Hm, bizzare.

Ultimately, Jobson's just running a subprocess with the same arguments as your command-line. It doesn't really do much more than that (as in, there isn't any bizarre process interception going on in Jobson). No tracing, no custom signal handlers (apart from what's typical for a child process), etc.

jobson's execution environment does differ from an interactive shell (e.g. bash or zsh) in a few ways though:

For the file being empty until you abort the process, that's not necessarily Jobson-related. When you click abort in the UI (or API) then, server-side, the jobson server is sending a SIGTERM (or SIGINT - it's been a while) to the process, followed by waiting for some maximum amount of time, followed by sending a SIGKILL if the process still hasn't died.

Some applications buffer their output internally before writing to a file and will only flush once the buffer is full (e.g. 4096 bytes of data were written to it) or when the application is terminating from a signal (e.g. platforms use something like atexit to install a function that flushes all buffers just before the application dies). So I imagine your application is:

This seems to indicate that your process is at least terminating gracefully, because a SIGKILL (which jobson would send eventually, if the process was truly deadlocked) would nuke the process from orbit and give it no chance to flush buffers.

adamkewley commented 3 years ago

@Russel88 do you have a copy of this script, or a basic example that I can try locally?

Russel88 commented 3 years ago

Hi Adam

Thank you for that detailed answer! I will test it some more.

It is a one-liner, which uses an external command. MYURL=$(esearch -db assembly -query "${1}" | esummary | xtract -pattern DocumentSummary -element FtpPath_GenBank | sed 's/\(GCA_.*\)$/\1\/\1_genomic.fna.gz/')

${1} would then be the argument for the script, and the output should be an URL. The input could for example be "GCA_000240185"

So if I run it like bash script.sh GCA_000240185 with an echo $MYURL after the above one-liner the output is: ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/240/185/GCA_000240185.2_ASM24018v2/GCA_000240185.2_ASM24018v2_genomic.fna.gz

The external programs can be installed via conda: https://anaconda.org/bioconda/entrez-direct or it can be downloaded here: ftp://ftp.ncbi.nlm.nih.gov/entrez/entrezdirect/edirect.tar.gz

adamkewley commented 3 years ago

Note to self: entrez-direct direct install from NCBI (or whoever) without conda seems to explode, going to need to use the conda version:

esearch -db assembly -query "GCA_000240185"
<ENTREZ_DIRECT>
  <Db>assembly</Db>
  <WebEnv>MCID_5fd7b087142cfa3f8e55a34e</WebEnv>
  <QueryKey>1</QueryKey>
  <Count>2</Count>
  <Step>1</Step>
</ENTREZ_DIRECT>

# then esummary
esearch -db assembly -query "GCA_000240185" | esummary

ERROR: No -pattern in command-line arguments

ERROR: No -pattern in command-line arguments
adamkewley commented 3 years ago

Reproduced Russel's command fine on a clean docker container /w conda.

Nothing about the command seems unusual and, as far as Jobson's concerned, the assignment syntax (x=$(cmd)) is going to manifest as bash spawning subprocesses (which jobson doesn't see because all it sees is a black box top-level process). My only guess at this point is that bash does something unusual w.r.t. signals as subprocesses are spawned/die, or that this particular script outputs data in a weird way (shouldn't affect jobson), or that the piping causes issues (e.g. a SIGPIPE becomes important).

Installed jobson into the container with:

wget https://github.com/adamkewley/jobson/releases/download/1.0.10/jobson_1.0.10_all.deb
apt-get install -yf ./jobson_1.0.10_all.deb

Used jobson generate spec russel88 to generate spec:

name: Generated job spec

description: >
  A job spec generated by jobson. The admin should've changed this.

expectedInputs: []

execution:

  application: bash

  arguments:  # Optional
  - "-c"
  - 'esearch -db assembly -query "GCA_000240185" | esummary |  xtract -pattern DocumentSummary -element FtpPath_GenBank | sed "s/\(GCA_.*\)$/\1\/\1_genomic.fna.gz/"'

Used jobson generate request russel88 > req.json to generate blank request:

{
  "spec" : "russel88",
  "name" : "Dolorem veritatis id neque occaecati molestiae hic voluptatem vero.",
  "inputs" : { }
}

Then ran it using jobson run req.json, which deadlocks, as @Russel88 reports. The deadlocking process looks like this:

    1 root      20   0   18640   3596   3092 S   0.0  0.0   0:00.11 bash                         
 8197 root      20   0   18376   3088   2836 S   0.0  0.0   0:00.00  `- jobson                   
 8198 root      20   0 7723184 170772  28076 S   0.0  1.0   0:04.17      `- java                 
 8226 root      20   0   18376   3124   2872 S   0.0  0.0   0:00.00          `- bash             
 8229 root      20   0   78664  30368   7988 S   0.0  0.2   0:00.13              `- perl         
 8231 root      20   0   78696  30456   8076 S   0.0  0.2   0:00.13              `- perl         
 8232 root      20   0  705948  10536   2152 S   0.0  0.1   0:00.00              `- xtract       
 8233 root      20   0   13560   1116   1012 S   0.0  0.0   0:00.00              `- sed          
 8175 root      20   0   18640   3536   3056 S   0.0  0.0   0:00.03 bash                         
 8196 root      20   0   36596   3164   2732 R   0.0  0.0   0:00.09  `- top

Which is exactly within expectations, given how the pipe syntax works in bash.

Stripping this problem down until it doesn't crash, this still crashes bash -c 'esearch -db assembly -query "GCA_000240185"'. And, removing bash from the equation, this also crashes:

name: Generated job spec

description: >
  A job spec generated by jobson. The admin should've changed this.

expectedInputs: []

execution:

  application: esearch

  arguments:  # Optional
  - '-db'
  - assembly 
  - '-query'
  - GCA_000240185

So we can (for now) ignore bash, piping syntax, or anything like that. There's something specifically screwey about how esearch interacts with Jobson.

Quick read through esearch indicates that it's actually a bash script that execs into a perl script after it handles args, so first thing to ensure is that exec works as intended with Jobson:

execution:

  application: bash
  arguments: [-c, 'exec ls /']

All fine. Maybe perl is specifically broken?:

execution:

  application: perl
  arguments: [-e, 'print "hello, world"']

Also fine. So this is very specifically about how esearch uses perl when under jobson can cause a deadlock. Wonderful.

Unfortunately, I'm out of time for tonight's hacking session to figure out the details. It's looking like I'll have to modify esearch slightly to try some before/after analysis, or I'll have to attach strace or gdb to the running processes to figure out what thread is deadlocked on what. It seems like something I didn't anticipate is happening in either jobson or perl here, though - maybe some signal flags or something.

adamkewley commented 3 years ago

As always, sorry for the delay, understandable if you've moved onto better supported systems, etc.

Investigating this more, the deadlock in esearch, from the perl side, is that it's trying to read from stdin:

#0  0x00007f527a04d50e in __libc_read (fd=0, buf=0x5621a0ccd9d0, nbytes=8192) at ../sysdeps/unix/sysv/linux/read.c:26
#1  0x000056219ee63c0d in PerlIOUnix_read ()
#2  0x000056219ee671b8 in PerlIOBuf_fill ()
#3  0x000056219ee65748 in Perl_PerlIO_fill ()
#4  0x000056219ee670b0 in PerlIOBase_read ()
#5  0x000056219ee690a8 in PerlIO_getc ()
#6  0x000056219edfeaf5 in Perl_sv_gets ()
#7  0x000056219eddc018 in Perl_do_readline ()
#8  0x000056219edd6f26 in Perl_runops_standard ()
#9  0x000056219ed57d67 in perl_run ()
#10 0x000056219ed2f442 in main ()

And jobson is waiting for perl to write to its stdout/stderr streams (by convention, these should close automatically if the child process exits):

─── Threads ────────────────────────────────────────────────────────────────────
[27] id 17598 name Thread-3 from 0x00007f4837c2b00c in futex_wait_cancelable+351 at ../sysdeps/unix/sysv/linux/futex-internal.h:88
[26] id 17597 name Thread-2 from 0x00007f4837c2e544 in __libc_read+20 at ../sysdeps/unix/sysv/linux/read.c:26
[25] id 17596 name Thread-1 from 0x00007f4837c2e544 in __libc_read+20 at ../sysdeps/unix/sysv/linux/read.c:26
[24] id 17593 name AsyncAppender-W from 0x00007f4837c2b00c in futex_wait_cancelable+351 at ../sysdeps/unix/sysv/linux/futex-internal.h:88
[23] id 17591 name process reaper from 0x00007f4837c2f0ca in __waitpid+74 at ../sysdeps/unix/sysv/linux/waitpid.c:30

# other stuff

# thread 26:
#0  __libc_read (nbytes=8192, buf=0x7f47bb4f85a0, fd=134) at ../sysdeps/unix/sysv/linux/read.c:26
#1  __libc_read (fd=134, buf=0x7f47bb4f85a0, nbytes=8192) at ../sysdeps/unix/sysv/linux/read.c:24
#2  0x00007f48364c2938 in ?? () from /usr/lib/jvm/java-11-openjdk-amd64/lib/libjava.so
#3  0x00007f48364c2144 in ?? () from /usr/lib/jvm/java-11-openjdk-amd64/lib/libjava.so
#4  0x00007f48188a8370 in ?? ()
#5  0x00007f4830ea7000 in ?? ()
#6  0x00007f48372d3f4d in ?? () from /usr/lib/jvm/java-11-openjdk-amd64/lib/server/libjvm.so
#7  0x00007f48188a2600 in ?? ()
#8  0x0000000000000000 in ?? ()

This is all within expectations, given the current jobson implementation: it's launched an esearch process and is waiting for it to write something (if it writes anything) and will wait on its exit code. It's Java, but it's effectively a thin layer over a wait and two reads from some pipes. Investigating which fd it's reading from confirms this:

# the fd jobson is waiting on: read-side pipe
ls -la /proc/17566/fd | grep 134
lr-x------ 1 adam adam 64 Dec 30 11:36 134 -> pipe:[166202]

# esearch: 166202 is fd == 1, which is its stdout fd
ls -la /proc/17594/fd | grep 166202
l-wx------ 1 adam adam 64 Dec 30 11:00 1 -> pipe:[166202]

Why is it blocking on stdin specifically when running via jobson though? jobson doesn't generally use stdin at all (although it doesn't preemptively close it, either). The esearch invocation, when ran manually, doesn't block for user input, so there's something different about how it's running under jobson.

However, the arguments--as pulled directly from linux, not Jobson--are within expectations:

cat /proc/17594/cmdline <(echo "") | tr '\0' ' '
perl /home/adam/anaconda3/bin/edirect.pl -search -db assembly -query GCA_000240185

# run it:
(cat /proc/17594/cmdline <(echo "") | tr '\0' ' ')
<ENTREZ_DIRECT>
  <Db>assembly</Db>
  <WebEnv>MCID_5fec6a1cc826f409ad559ad8</WebEnv>
  <QueryKey>1</QueryKey>
  <Count>2</Count>
  <Step>1</Step>
</ENTREZ_DIRECT>

And diffing the environment variables of my "raw" terminal with those set by Jobson shows:

< GNOME_TERMINAL_SCREEN=/org/gnome/Terminal/screen/ae9f030d_6062_4b19_a573_9502386f8400
---
> GNOME_TERMINAL_SCREEN=/org/gnome/Terminal/screen/d2edf238_08ba_4f48_a81f_1a22591c0f26
27a28
> OLDPWD=/home/adam/Desktop
30c31
< PWD=/home/adam
---
> PWD=/home/adam/Desktop/iss67
36c37
< SHLVL=0
---
> SHLVL=2
42c43
< _=/usr/bin/env
---
> _=/usr/bin/java

Also nothing remarkable (SHLVL essentially just shows how "nested" the invocation is and is ignored by the majority of software). Even manually overriding those vars to match Jobson still lets the application run:

SHLVL=2 _=/usr/bin/java PWD=/home/adam/Desktop/iss67 perl /home/adam/anaconda3/bin/edirect.pl -search -db assembly -query GCA_000240185
<ENTREZ_DIRECT>
  <Db>assembly</Db>
  <WebEnv>MCID_5fec6ac9da93674c76610492</WebEnv>
  <QueryKey>1</QueryKey>
  <Count>2</Count>
  <Step>1</Step>
</ENTREZ_DIRECT>
adamkewley commented 3 years ago

Found Problem: esearch changes its behavior if it detects standard input from a non-tty

Going back to my initial suspicion, which is that esearch will read from stdin if it detects that stdin is not a TTY, this minimal example does block, just as with jobson:

mkfifo a
perl /home/adam/anaconda3/bin/edirect.pl -search -db assembly -query GCA_000240185 < a

Logically, it shouldn't (why doesn't it block when called manually), but it seems like esearch does have special behavior when it detects that it's standard input is connected to something that isn't a tty and it will block trying to read it (above, it will block indefinitely until I append to a).

It looks like it's an undocumented "feature" of esearch (this isn't mentioned in --help) that it can read query strings from its standard input. This enables behavior like:

printf "GCA_000240185\nGCA_000240184" > query
perl /home/adam/anaconda3/bin/edirect.pl -search -query ignored -db assembly < query
<ENTREZ_DIRECT>
  <Db>assembly</Db>
  <WebEnv>MCID_5fec6e4b9ba75c2aa603cc82</WebEnv>
  <QueryKey>1</QueryKey>
  <Count>0</Count>
  <Step>1</Step>
</ENTREZ_DIRECT>

Although this "feature" isn't so obvious. Does it read multiple queries, one per line? Does it still execute the (required) -query provided at the CLI? No idea. A quick skim of the (separate) online documentation (https://www.ncbi.nlm.nih.gov/books/NBK25499/) doesn't seem to mention it.

Either way @Russel88 , the quick "hacky" fix is to pump blank input into esearch and see if it still obeys the -query argument:

perl /home/adam/anaconda3/bin/edirect.pl -search -query GCA_000240185 -db assembly < <(echo "")

(or try the reverse, if it does read the input but ignore the arg).

One hotfix I can make to Jobson is to have it close stdin on bootup, because stdin shouldn't be used by any jobson application anyway. I'll see how easy this is and get back.

adamkewley commented 3 years ago

Created a patch (above) that should fix this behavior (hopefully). Staging it for 1.0.11 here:

https://travis-ci.org/github/adamkewley/jobson/builds/752118316

adamkewley commented 3 years ago

The proposed (patched) jobson will should work with esearch:

(base) adam@akmain:~/Desktop/iss67$ jobson --help
usage: java -jar jobson-1.0.11.jar
       [-h] [-v] {server,check,new,generate,users,validate,run} ...

positional arguments:
  {server,check,new,generate,users,validate,run}
                         available commands

named arguments:
  -h, --help             show this help message and exit
  -v, --version          show the application version and exit
(base) adam@akmain:~/Desktop/iss67$ jobson run req.json 
<ENTREZ_DIRECT>
  <Db>assembly</Db>
  <WebEnv>MCID_5fec78239ccb3879584ad42b</WebEnv>
  <QueryKey>1</QueryKey>
  <Count>2</Count>
  <Step>1</Step>
</ENTREZ_DIRECT>

And this has also fixed a nasty deadlocking bug that could've bit other users who didn't report it, so it's a good catch - thanks for reporting this @Russel88

CI looks like it's having issues that are unrelated to the patch (usual CI thing, something's changed, build scripts need kicking etc.). I'll try and fix CI and get the release rolled out for you. Once it's rolled out, it should be a straight upgrade (there are no breaking changes).

adamkewley commented 3 years ago

Fixed CI, so the 1.0.11 release is now on the rails and should hopefully automatically release once CI completes (the deployment step can also fail due to config changes, but is usually stable). Build here:

https://travis-ci.org/github/adamkewley/jobson/builds/752120558

Once 1.0.11 is automatically published, it should be on the release page and on dockerhub.

adamkewley commented 3 years ago

@Russel88 this should be fixed in this release:

https://github.com/adamkewley/jobson/releases/tag/1.0.11

Please get back to me if there's still a problem!

Russel88 commented 3 years ago

Thanks a lot for digging into this! Very useful I have included the hacky fix, and it works like a charm. When I have the time, I will try it out with the new jobson version.

By the way, I think you are providing excellent support. And jobson is working much smoother than the average bioinformatic webserver. And importantly, my users seem satisfied :)

Cheers and happy new years, Jakob