Closed Russel88 closed 3 years ago
Hm, bizzare.
Ultimately, Jobson's just running a subprocess with the same arguments as your command-line. It doesn't really do much more than that (as in, there isn't any bizarre process interception going on in Jobson). No tracing, no custom signal handlers (apart from what's typical for a child process), etc.
jobson
's execution environment does differ from an interactive shell (e.g. bash
or zsh
) in a few ways though:
tty
, and any checks inside the script can see this (under the hood, they could call isatty
- this would return 1
in your terminal and 0
in Jobson: even if the script is inside another script (unless the inner script redirects output into another application or file)
echo "" > in && ./app <in 2>&1 1>out
). That way, all input/output pipes will return false
for a isatty
check.jobson
does not use a shell (i.e. bash
) to execute subprocesses, so if your script is dependent on (e.g.) anything in .bashrc
then the script will fail when running under Jobson.
bash
from Jobson, rather than directly invoking perl
stdout
and stderr
outputs, which is how it shows output in the UI. In principle, this shouldn't be an issue because (e.g.) terminal emulators effectively need to do the same (as far as I'm aware, a terminal emulator is just reading those streams, like jobson
would, and rendering the text to the UI).
jobson
because the order of outputs were mixed with exit codes (or something bizzare like that)For the file being empty until you abort the process, that's not necessarily Jobson-related. When you click abort
in the UI (or API) then, server-side, the jobson
server is sending a SIGTERM
(or SIGINT
- it's been a while) to the process, followed by waiting for some maximum amount of time, followed by sending a SIGKILL
if the process still hasn't died.
Some applications buffer their output internally before writing to a file and will only flush once the buffer is full (e.g. 4096
bytes of data were written to it) or when the application is terminating from a signal (e.g. platforms use something like atexit
to install a function that flushes all buffers just before the application dies). So I imagine your application is:
open
ing the file (as an empty file)SIGTERM
/SIGINT
ed from jobson
because a user clicked abort
atexit
, which populates the fileThis seems to indicate that your process is at least terminating gracefully, because a SIGKILL
(which jobson
would send eventually, if the process was truly deadlocked) would nuke the process from orbit and give it no chance to flush buffers.
@Russel88 do you have a copy of this script, or a basic example that I can try locally?
Hi Adam
Thank you for that detailed answer! I will test it some more.
It is a one-liner, which uses an external command.
MYURL=$(esearch -db assembly -query "${1}" | esummary | xtract -pattern DocumentSummary -element FtpPath_GenBank | sed 's/\(GCA_.*\)$/\1\/\1_genomic.fna.gz/')
${1} would then be the argument for the script, and the output should be an URL. The input could for example be "GCA_000240185"
So if I run it like bash script.sh GCA_000240185
with an echo $MYURL
after the above one-liner the output is:
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/240/185/GCA_000240185.2_ASM24018v2/GCA_000240185.2_ASM24018v2_genomic.fna.gz
The external programs can be installed via conda: https://anaconda.org/bioconda/entrez-direct or it can be downloaded here: ftp://ftp.ncbi.nlm.nih.gov/entrez/entrezdirect/edirect.tar.gz
Note to self: entrez-direct
direct install from NCBI (or whoever) without conda seems to explode, going to need to use the conda version:
esearch -db assembly -query "GCA_000240185"
<ENTREZ_DIRECT>
<Db>assembly</Db>
<WebEnv>MCID_5fd7b087142cfa3f8e55a34e</WebEnv>
<QueryKey>1</QueryKey>
<Count>2</Count>
<Step>1</Step>
</ENTREZ_DIRECT>
# then esummary
esearch -db assembly -query "GCA_000240185" | esummary
ERROR: No -pattern in command-line arguments
ERROR: No -pattern in command-line arguments
Reproduced Russel's command fine on a clean docker
container /w conda.
Nothing about the command seems unusual and, as far as Jobson's concerned, the assignment syntax (x=$(cmd)
) is going to manifest as bash
spawning subprocesses (which jobson
doesn't see because all it sees is a black box top-level process). My only guess at this point is that bash
does something unusual w.r.t. signals as subprocesses are spawned/die, or that this particular script outputs data in a weird way (shouldn't affect jobson), or that the piping causes issues (e.g. a SIGPIPE becomes important).
Installed jobson
into the container with:
wget https://github.com/adamkewley/jobson/releases/download/1.0.10/jobson_1.0.10_all.deb
apt-get install -yf ./jobson_1.0.10_all.deb
Used jobson generate spec russel88
to generate spec:
name: Generated job spec
description: >
A job spec generated by jobson. The admin should've changed this.
expectedInputs: []
execution:
application: bash
arguments: # Optional
- "-c"
- 'esearch -db assembly -query "GCA_000240185" | esummary | xtract -pattern DocumentSummary -element FtpPath_GenBank | sed "s/\(GCA_.*\)$/\1\/\1_genomic.fna.gz/"'
Used jobson generate request russel88 > req.json
to generate blank request:
{
"spec" : "russel88",
"name" : "Dolorem veritatis id neque occaecati molestiae hic voluptatem vero.",
"inputs" : { }
}
Then ran it using jobson run req.json
, which deadlocks, as @Russel88 reports. The deadlocking process looks like this:
1 root 20 0 18640 3596 3092 S 0.0 0.0 0:00.11 bash
8197 root 20 0 18376 3088 2836 S 0.0 0.0 0:00.00 `- jobson
8198 root 20 0 7723184 170772 28076 S 0.0 1.0 0:04.17 `- java
8226 root 20 0 18376 3124 2872 S 0.0 0.0 0:00.00 `- bash
8229 root 20 0 78664 30368 7988 S 0.0 0.2 0:00.13 `- perl
8231 root 20 0 78696 30456 8076 S 0.0 0.2 0:00.13 `- perl
8232 root 20 0 705948 10536 2152 S 0.0 0.1 0:00.00 `- xtract
8233 root 20 0 13560 1116 1012 S 0.0 0.0 0:00.00 `- sed
8175 root 20 0 18640 3536 3056 S 0.0 0.0 0:00.03 bash
8196 root 20 0 36596 3164 2732 R 0.0 0.0 0:00.09 `- top
Which is exactly within expectations, given how the pipe syntax works in bash
.
Stripping this problem down until it doesn't crash, this still crashes bash -c 'esearch -db assembly -query "GCA_000240185"'
. And, removing bash
from the equation, this also crashes:
name: Generated job spec
description: >
A job spec generated by jobson. The admin should've changed this.
expectedInputs: []
execution:
application: esearch
arguments: # Optional
- '-db'
- assembly
- '-query'
- GCA_000240185
So we can (for now) ignore bash, piping syntax, or anything like that. There's something specifically screwey about how esearch
interacts with Jobson.
Quick read through esearch
indicates that it's actually a bash script that exec
s into a perl script after it handles args, so first thing to ensure is that exec
works as intended with Jobson:
execution:
application: bash
arguments: [-c, 'exec ls /']
All fine. Maybe perl is specifically broken?:
execution:
application: perl
arguments: [-e, 'print "hello, world"']
Also fine. So this is very specifically about how esearch
uses perl
when under jobson
can cause a deadlock. Wonderful.
Unfortunately, I'm out of time for tonight's hacking session to figure out the details. It's looking like I'll have to modify esearch
slightly to try some before/after analysis, or I'll have to attach strace
or gdb
to the running processes to figure out what thread is deadlocked on what. It seems like something I didn't anticipate is happening in either jobson
or perl
here, though - maybe some signal flags or something.
As always, sorry for the delay, understandable if you've moved onto better supported systems, etc.
Investigating this more, the deadlock in esearch
, from the perl
side, is that it's trying to read from stdin
:
#0 0x00007f527a04d50e in __libc_read (fd=0, buf=0x5621a0ccd9d0, nbytes=8192) at ../sysdeps/unix/sysv/linux/read.c:26
#1 0x000056219ee63c0d in PerlIOUnix_read ()
#2 0x000056219ee671b8 in PerlIOBuf_fill ()
#3 0x000056219ee65748 in Perl_PerlIO_fill ()
#4 0x000056219ee670b0 in PerlIOBase_read ()
#5 0x000056219ee690a8 in PerlIO_getc ()
#6 0x000056219edfeaf5 in Perl_sv_gets ()
#7 0x000056219eddc018 in Perl_do_readline ()
#8 0x000056219edd6f26 in Perl_runops_standard ()
#9 0x000056219ed57d67 in perl_run ()
#10 0x000056219ed2f442 in main ()
And jobson
is waiting for perl
to write to its stdout
/stderr
streams (by convention, these should close automatically if the child process exit
s):
─── Threads ────────────────────────────────────────────────────────────────────
[27] id 17598 name Thread-3 from 0x00007f4837c2b00c in futex_wait_cancelable+351 at ../sysdeps/unix/sysv/linux/futex-internal.h:88
[26] id 17597 name Thread-2 from 0x00007f4837c2e544 in __libc_read+20 at ../sysdeps/unix/sysv/linux/read.c:26
[25] id 17596 name Thread-1 from 0x00007f4837c2e544 in __libc_read+20 at ../sysdeps/unix/sysv/linux/read.c:26
[24] id 17593 name AsyncAppender-W from 0x00007f4837c2b00c in futex_wait_cancelable+351 at ../sysdeps/unix/sysv/linux/futex-internal.h:88
[23] id 17591 name process reaper from 0x00007f4837c2f0ca in __waitpid+74 at ../sysdeps/unix/sysv/linux/waitpid.c:30
# other stuff
# thread 26:
#0 __libc_read (nbytes=8192, buf=0x7f47bb4f85a0, fd=134) at ../sysdeps/unix/sysv/linux/read.c:26
#1 __libc_read (fd=134, buf=0x7f47bb4f85a0, nbytes=8192) at ../sysdeps/unix/sysv/linux/read.c:24
#2 0x00007f48364c2938 in ?? () from /usr/lib/jvm/java-11-openjdk-amd64/lib/libjava.so
#3 0x00007f48364c2144 in ?? () from /usr/lib/jvm/java-11-openjdk-amd64/lib/libjava.so
#4 0x00007f48188a8370 in ?? ()
#5 0x00007f4830ea7000 in ?? ()
#6 0x00007f48372d3f4d in ?? () from /usr/lib/jvm/java-11-openjdk-amd64/lib/server/libjvm.so
#7 0x00007f48188a2600 in ?? ()
#8 0x0000000000000000 in ?? ()
This is all within expectations, given the current jobson
implementation: it's launched an esearch
process and is waiting for it to write something (if it writes anything) and will wait
on its exit code. It's Java, but it's effectively a thin layer over a wait
and two read
s from some pipe
s. Investigating which fd
it's read
ing from confirms this:
# the fd jobson is waiting on: read-side pipe
ls -la /proc/17566/fd | grep 134
lr-x------ 1 adam adam 64 Dec 30 11:36 134 -> pipe:[166202]
# esearch: 166202 is fd == 1, which is its stdout fd
ls -la /proc/17594/fd | grep 166202
l-wx------ 1 adam adam 64 Dec 30 11:00 1 -> pipe:[166202]
Why is it blocking on stdin
specifically when running via jobson
though? jobson
doesn't generally use stdin
at all (although it doesn't preemptively close
it, either). The esearch
invocation, when ran manually, doesn't block for user input, so there's something different about how it's running under jobson.
However, the arguments--as pulled directly from linux, not Jobson--are within expectations:
cat /proc/17594/cmdline <(echo "") | tr '\0' ' '
perl /home/adam/anaconda3/bin/edirect.pl -search -db assembly -query GCA_000240185
# run it:
(cat /proc/17594/cmdline <(echo "") | tr '\0' ' ')
<ENTREZ_DIRECT>
<Db>assembly</Db>
<WebEnv>MCID_5fec6a1cc826f409ad559ad8</WebEnv>
<QueryKey>1</QueryKey>
<Count>2</Count>
<Step>1</Step>
</ENTREZ_DIRECT>
And diffing the environment variables of my "raw" terminal with those set by Jobson shows:
< GNOME_TERMINAL_SCREEN=/org/gnome/Terminal/screen/ae9f030d_6062_4b19_a573_9502386f8400
---
> GNOME_TERMINAL_SCREEN=/org/gnome/Terminal/screen/d2edf238_08ba_4f48_a81f_1a22591c0f26
27a28
> OLDPWD=/home/adam/Desktop
30c31
< PWD=/home/adam
---
> PWD=/home/adam/Desktop/iss67
36c37
< SHLVL=0
---
> SHLVL=2
42c43
< _=/usr/bin/env
---
> _=/usr/bin/java
Also nothing remarkable (SHLVL
essentially just shows how "nested" the invocation is and is ignored by the majority of software). Even manually overriding those vars to match Jobson still lets the application run:
SHLVL=2 _=/usr/bin/java PWD=/home/adam/Desktop/iss67 perl /home/adam/anaconda3/bin/edirect.pl -search -db assembly -query GCA_000240185
<ENTREZ_DIRECT>
<Db>assembly</Db>
<WebEnv>MCID_5fec6ac9da93674c76610492</WebEnv>
<QueryKey>1</QueryKey>
<Count>2</Count>
<Step>1</Step>
</ENTREZ_DIRECT>
esearch
changes its behavior if it detects standard input from a non-ttyGoing back to my initial suspicion, which is that esearch
will read
from stdin
if it detects that stdin
is not a TTY, this minimal example does block, just as with jobson:
mkfifo a
perl /home/adam/anaconda3/bin/edirect.pl -search -db assembly -query GCA_000240185 < a
Logically, it shouldn't (why doesn't it block when called manually), but it seems like esearch
does have special behavior when it detects that it's standard input is connected to something that isn't a tty and it will block trying to read
it (above, it will block indefinitely until I append to a
).
It looks like it's an undocumented "feature" of esearch
(this isn't mentioned in --help
) that it can read query strings from its standard input. This enables behavior like:
printf "GCA_000240185\nGCA_000240184" > query
perl /home/adam/anaconda3/bin/edirect.pl -search -query ignored -db assembly < query
<ENTREZ_DIRECT>
<Db>assembly</Db>
<WebEnv>MCID_5fec6e4b9ba75c2aa603cc82</WebEnv>
<QueryKey>1</QueryKey>
<Count>0</Count>
<Step>1</Step>
</ENTREZ_DIRECT>
Although this "feature" isn't so obvious. Does it read multiple queries, one per line? Does it still execute the (required) -query
provided at the CLI? No idea. A quick skim of the (separate) online documentation (https://www.ncbi.nlm.nih.gov/books/NBK25499/) doesn't seem to mention it.
Either way @Russel88 , the quick "hacky" fix is to pump blank input into esearch
and see if it still obeys the -query
argument:
perl /home/adam/anaconda3/bin/edirect.pl -search -query GCA_000240185 -db assembly < <(echo "")
(or try the reverse, if it does read the input but ignore the arg).
One hotfix I can make to Jobson is to have it close stdin
on bootup, because stdin
shouldn't be used by any jobson
application anyway. I'll see how easy this is and get back.
Created a patch (above) that should fix this behavior (hopefully). Staging it for 1.0.11 here:
https://travis-ci.org/github/adamkewley/jobson/builds/752118316
The proposed (patched) jobson will should work with esearch
:
(base) adam@akmain:~/Desktop/iss67$ jobson --help
usage: java -jar jobson-1.0.11.jar
[-h] [-v] {server,check,new,generate,users,validate,run} ...
positional arguments:
{server,check,new,generate,users,validate,run}
available commands
named arguments:
-h, --help show this help message and exit
-v, --version show the application version and exit
(base) adam@akmain:~/Desktop/iss67$ jobson run req.json
<ENTREZ_DIRECT>
<Db>assembly</Db>
<WebEnv>MCID_5fec78239ccb3879584ad42b</WebEnv>
<QueryKey>1</QueryKey>
<Count>2</Count>
<Step>1</Step>
</ENTREZ_DIRECT>
And this has also fixed a nasty deadlocking bug that could've bit other users who didn't report it, so it's a good catch - thanks for reporting this @Russel88
CI looks like it's having issues that are unrelated to the patch (usual CI thing, something's changed, build scripts need kicking etc.). I'll try and fix CI and get the release rolled out for you. Once it's rolled out, it should be a straight upgrade (there are no breaking changes).
Fixed CI, so the 1.0.11 release is now on the rails and should hopefully automatically release once CI completes (the deployment step can also fail due to config changes, but is usually stable). Build here:
https://travis-ci.org/github/adamkewley/jobson/builds/752120558
Once 1.0.11 is automatically published, it should be on the release page and on dockerhub.
@Russel88 this should be fixed in this release:
https://github.com/adamkewley/jobson/releases/tag/1.0.11
Please get back to me if there's still a problem!
Thanks a lot for digging into this! Very useful I have included the hacky fix, and it works like a charm. When I have the time, I will try it out with the new jobson version.
By the way, I think you are providing excellent support. And jobson is working much smoother than the average bioinformatic webserver. And importantly, my users seem satisfied :)
Cheers and happy new years, Jakob
Hi Adam
I have a weird problem that I hope you can help me solve. I have a script that runs fine when I invoke it through the command-line, but when the exact same script with same arguments is invoked by jobson it stalls, that is, it just keeps saying "Running" forever.
The part of the script that stalls is some perl commands that retrieves information from a website.
Now to the weird part. The output from the problematic code block is directed to a file. That file is empty while jobson says "Running", but as soon as I abort the job, the expected output is written to the file, but the rest of the script is of course not run.
Do you have any idea what could be causing this, and is it even jobson related? I have tried putting the problematic code in a script by it self and invoking that in a sub shell to no avail.
Any help is much appreciated!
Regards, Jakob