Closed etiennepericat closed 2 years ago
Hello @EtiennePer and thanks for your report. Can you reproduce your problem with cwltool
, and if so, which version?
Does your system have nodejs
installed? If so, what version?
Or do you use the docker/singularity container fallback? Then please let us know your docker engine or Singularity version
Hello @mr-c ,
Thanks for your quick answer.
I didn't take the time to reproduce the case directly with cwltool, but there are the version we are using :
we are not using any container fallback.
@EtiennePer are you running this on a Toil cluster / linux environment by any chance? I was able to reproduce this on a Toil cluster, where cwltool
doesn't seem to terminate the nodejs processes that it spawns (at least on a Toil cluster).
For example, running this test workflow with cwltool
works but 2 defunct nodejs processes are left behind:
# test.cwl
class: ExpressionTool
requirements:
- class: InlineJavascriptRequirement
cwlVersion: v1.2
inputs: []
outputs:
output: int
expression: "$({'output': 1})"
# test.json
{}
root@ip-172-31-18-73:~/tests# ps -Al
F S UID PID PPID C PRI NI ADDR SZ WCHAN TTY TIME CMD
4 S 0 1 0 0 80 0 - 211726 - ? 00:00:00 mesos-master
4 S 0 17 0 0 80 0 - 4574 - pts/0 00:00:00 bash
0 Z 0 45 1 9 80 0 - 0 - pts/0 00:00:00 nodejs <defunct>
0 Z 0 59 1 2 80 0 - 0 - pts/0 00:00:00 nodejs <defunct>
0 R 0 67 17 0 80 0 - 6495 - pts/0 00:00:00 ps
With toil-cwl-runner
though, Toil wants to clean up all its children so it sits in an infinite loop waiting for the child processes to terminate, but it never gets to kill the ones spawned by cwltool.
@adamnovak and I suspected that this nodejs process is what stays behind since popen.poll()
is non-blocking. So we inspected the processes during this loop, which showed that the stdin was never closed:
root@ip-172-31-18-73:~/tests# ps -Al
F S UID PID PPID C PRI NI ADDR SZ WCHAN TTY TIME CMD
...
0 T 0 160 17 14 80 0 - 121956 - pts/0 00:00:04 toil-cwl-runner
0 T 0 168 160 1 80 0 - 222947 - pts/0 00:00:00 nodejs
root@ip-172-31-18-73:~/tests# ps x | less
...
168 pts/0 Tl 0:00 nodejs --eval "use strict"; process.stdin.setEncoding("utf8"); var incoming = ""; var firstInput = true; var context = {}; process.stdin.on("data", function(chunk) { incoming += chunk; var i = incoming.indexOf("\n"); while (i > -1) { try{ var input = incoming.substr(0, i); incoming = incoming.substr(i+1); var fn = JSON.parse(input); if(firstInput){ context = require("vm").runInNewContext(fn, {}); } else{ process.stdout.write(JSON.stringify(require("vm").runInNewContext(fn, context)) + "\n"); } } catch(e){ console.error(e); } if(firstInput){ firstInput = false; } else{ /*strings to indicate the process has finished*/ console.log("r1cepzbhUTxtykz5XTC4"); console.error("r1cepzbhUTxtykz5XTC4"); } i = incoming.indexOf("\n"); } }); process.stdin.on("end", process.exit);
After installing a local copy of cwltool
and making the following changes here, the nodejs processes seem to exit properly for both cwltool
and toil-cwl-runner
:
stdin_buf.close()
nodejs.stdin.close()
...
nodejs.wait()
I suspect the stdin can be automatically closed on some systems but not all. Please let me know if this fixes your issue as well.
On a separate note, we also realized that we might not want Toil to do this check, so we might remove this loop and use a better init process so it can properly reap the zombie children.
BTW, you might need to uninstall the cwltool
installed by Toil before re-installing a local copy. Something like:
mkdir tests && cd tests
# set up a fresh virtual environment
virtualenv -p 3.8 venv && . venv/bin/activate
pip install -e git+https://github.com/DataBiosphere/toil.git@10c591b509d24d45cf58830e748e6a4e7f5e4f60#egg=toil[cwl]
pip uninstall -y cwltool
git clone https://github.com/common-workflow-language/cwltool && cd cwltool
git checkout 0209b0b7ce66f03c8498b5a686f8d31690a2acb3 # latest cwltool==3.1.20211107152837
# temp fix for cwltool
sed -i '294s/nodejs.poll()/nodejs.wait()/' cwltool/sandboxjs.py
sed -i '291i \ \ \ \ nodejs.stdin.close()' cwltool/sandboxjs.py
python setup.py install
cd ..
# run your cwl file with `toil-cwl-runner`
are you running this on a Toil cluster / linux environment by any chance?
We are running toil across multiple servers with Mesos.
I suspect the stdin can be automatically closed on some systems but not all. Please let me know if this fixes your issue as well.
Great ! It works like a charm.
We also noticed zombie children before but it was not a real issue at that time.
In any case, thank you very much both for your help :+1:
Yeah, the problem is a weird interaction:
cwltool
lets its nodejs
children become zombies because it doesn't wait()
on theminit
replacement and never wait()
s to reap zombiesGlad to hear it works @EtiennePer!
@adamnovak I have to test this but I think passing --init
to the docker run
command would enable tini
, which should be able to reap zombies?
@mr-c Should I also create an issue and a PR to wait()
on the nodejs
process here in cwltool? I'm pretty sure the intended behavior is to close the stdin and wait for the process to exit.
@mr-c Should I also create an issue and a PR to
wait()
on thenodejs
process here in cwltool? I'm pretty sure the intended behavior is to close the stdin and wait for the process to exit.
No. It is kept open on purpose, we re-use the nodejs
process to save on start up costs. However, they should have been killed when cwltool ends; there is the real bug. Maybe toil_worker
exits don't trigger the cwltool
nodejs process cleanup code?
@mr-c Ah, we did realize that the _terminate_processes()
function is never called by Toil, which I think is the nodejs process cleanup code? But then I don't see how that function calls process.wait()
and process.kill()
to clean up the processes that are not docker runs.
@mr-c Ah, we did realize that the
_terminate_processes()
function is never called by Toil, which I think is the nodejs process cleanup code?
We use processes_to_kill
Where each nodeja subprocess is appended to this, both non-containerized
Or containerized
So toil-cwl-runner
and _toil_worker
both need to call https://github.com/common-workflow-language/cwltool/blob/041cc0eb8f0272846e5b3e685fe51367f3ef93a6/cwltool/main.py#L103 from their shutdown routines. And also from an interrupt handler like https://github.com/common-workflow-language/cwltool/blob/041cc0eb8f0272846e5b3e685fe51367f3ef93a6/cwltool/main.py#L137 Via https://github.com/common-workflow-language/cwltool/blob/041cc0eb8f0272846e5b3e685fe51367f3ef93a6/cwltool/main.py#L1460
But then I don't see how that function calls
process.wait()
andprocess.kill()
to clean up the processes that are not docker runs.
Ah-ha, good point. We need to fix that in cwltool
, yes
Okay, in https://github.com/common-workflow-language/cwltool/pull/1577 I make cwltool
better behaved about the child process, but only if cwltool.main._terminate_processes()
is called; which toil-cwl-runner
does not do.
@adamnovak or @w-gao I don't see how to hook in an addition handler, as Toil.__enter__
sets up its own handler and discards the previous one. Can you adjust that behaviour, or add place where I can pass in another function to call as well?
Yeah, it's probably not the best idea to discard the previous handler here. Would something like this work better?
# Toil.__enter__:
user_handler = signal.getsignal(signal.SIGTERM)
def signal_handler(signum: int, frame: Any):
try:
if callable(user_handler):
user_handler(signum, frame)
finally:
# This will make sure __exit__ gets called when we get a SIGTERM signal.
sys.exit(1)
signal.signal(signal.SIGTERM, signal_handler)
This would only work if the user defines a handler before the with Toil(options)
block. Or, perhaps we could make the Toil context manager accept a callback function to call when sigterm is received?
This would only work if the user defines a handler before the
with Toil(options)
block. Or, perhaps we could make the Toil context manager accept a callback function to call when sigterm is received?
Why not both?
What I'm curious about, is if the user provided handler will be called from _toil_worker
processes as they exit
I did some testing and I don't think the signal handler in Toil.__enter__
would override the handlers in the _toil_worker
process. (I also started https://github.com/DataBiosphere/toil/compare/issues/3965-user-provided-exit-handler but I don't think it'll help much here)
However, the problem is that a sigterm signal on the leader process might not propagate to its workers, so I don't know the best way to handle interrupts like this on the worker. I can look more into this after winter break.
I think Toil.__enter__
only runs in the leader process, while cwltool javascript is generally going to run in the worker, right?
I don't think we need to make a special effort in the leader to reap child processes when we are killed. And trying to reap children when the worker is killed is really just a progressive enhancement.
I think what we really need is just a finally
on this try that will call cwltool
's cleanup function after the worker runs all its jobs, if cwltool
is installed.
If we want the worker to make an attempt to reap child processes when it itself is killed, we could try using the daemon
module which has a system to run a cleanup function at exit, on a signal, and at the end of a block. And if we want the leader to do it, we can add the cwltool
cleanup call to atexit
somewhere in the leader.
Making Toil.__enter__
add its signal handler to a chain instead of just replacing what's there makes sense, since people might not expect it to do that. Making it take a user signal handler itself might be extra machinery we don't really need.
Hello there, we are facing an issue with a workflow that is using this type of CWL file:
the problem appear only when i'm using "InlineJavascriptRequirement" with this kind of javascript expressions :
I also tried another notation, but the exact same issue appears:
The actual task is finishing correctly (with a return_code = 0 ), the input File generated by the Javascript expression is properly set and the file is retrieved.
BUT
The job never "finished" properly and loop over this message infinitely :
[2021-12-16T14:48:28+0000] [Thread-2 ] [W] [toil.batchSystems.singleMachine] Sent redundant job completion kill to surviving process group 615 known to batch system 140180515079744
Important note :
valueFrom: $(self[0])
I did not found any similar issue for now, and don't know what to try in order to fix it.
Any help would be really helpful.
Thanks a lot in advance :) , Etienne
┆Issue is synchronized with this Jira Task ┆friendlyId: TOIL-1110