Job never stop with javascript input

etiennepericat commented 2 years ago

Hello there, we are facing an issue with a workflow that is using this type of CWL file:

the problem appear only when i'm using "InlineJavascriptRequirement" with this kind of javascript expressions :

my_step:
    run: ./my_sub_cwl.cwl
    in:
      # Javascript sub selection
      array_input:
        source: array_input
        valueFrom: |
          ${
            for (var i = 0 ; i < self.length; i++) {
              if ( self[i].basename.includes("<here I'm putting any string in order to filter>")) {
                return self[i]
              }
            }
            return self[0]
          }

    out: [ out1, out2]

I also tried another notation, but the exact same issue appears:

valueFrom: $(inputs.array_input.filter(f => f.basename.includes("<here I'm putting any string in order to filter>"))[0])

The actual task is finishing correctly (with a return_code = 0 ), the input File generated by the Javascript expression is properly set and the file is retrieved.

BUT

The job never "finished" properly and loop over this message infinitely : [2021-12-16T14:48:28+0000] [Thread-2 ] [W] [toil.batchSystems.singleMachine] Sent redundant job completion kill to surviving process group 615 known to batch system 140180515079744

Important note :

We were not facing this issue with toil v5.3.0 but only after updating to v5.5.0 (we didn't tried with 5.4.0)
I'm not facing the issue if I use the following expression : valueFrom: $(self[0])

I did not found any similar issue for now, and don't know what to try in order to fix it.

Any help would be really helpful.

Thanks a lot in advance :) , Etienne

┆Issue is synchronized with this Jira Task ┆friendlyId: TOIL-1110

mr-c commented 2 years ago

Hello @EtiennePer and thanks for your report. Can you reproduce your problem with cwltool, and if so, which version?

Does your system have nodejs installed? If so, what version? Or do you use the docker/singularity container fallback? Then please let us know your docker engine or Singularity version

etiennepericat commented 2 years ago

Hello @mr-c ,

Thanks for your quick answer.

I didn't take the time to reproduce the case directly with cwltool, but there are the version we are using :

cwltool 3.1.20210922203925
node v14.15.3

we are not using any container fallback.

w-gao commented 2 years ago

@EtiennePer are you running this on a Toil cluster / linux environment by any chance? I was able to reproduce this on a Toil cluster, where cwltool doesn't seem to terminate the nodejs processes that it spawns (at least on a Toil cluster).

For example, running this test workflow with cwltool works but 2 defunct nodejs processes are left behind:

# test.cwl
class: ExpressionTool
requirements:
  - class: InlineJavascriptRequirement
cwlVersion: v1.2

inputs: []

outputs:
  output: int

expression: "$({'output': 1})"

# test.json
{}

root@ip-172-31-18-73:~/tests# ps -Al
F S   UID     PID    PPID  C PRI  NI ADDR SZ WCHAN  TTY          TIME CMD
4 S     0       1       0  0  80   0 - 211726 -     ?        00:00:00 mesos-master
4 S     0      17       0  0  80   0 -  4574 -      pts/0    00:00:00 bash
0 Z     0      45       1  9  80   0 -     0 -      pts/0    00:00:00 nodejs <defunct>
0 Z     0      59       1  2  80   0 -     0 -      pts/0    00:00:00 nodejs <defunct>
0 R     0      67      17  0  80   0 -  6495 -      pts/0    00:00:00 ps

With toil-cwl-runner though, Toil wants to clean up all its children so it sits in an infinite loop waiting for the child processes to terminate, but it never gets to kill the ones spawned by cwltool.

@adamnovak and I suspected that this nodejs process is what stays behind since popen.poll() is non-blocking. So we inspected the processes during this loop, which showed that the stdin was never closed:

root@ip-172-31-18-73:~/tests# ps -Al
F S   UID     PID    PPID  C PRI  NI ADDR SZ WCHAN  TTY          TIME CMD
...
0 T     0     160      17 14  80   0 - 121956 -     pts/0    00:00:04 toil-cwl-runner
0 T     0     168     160  1  80   0 - 222947 -     pts/0    00:00:00 nodejs

root@ip-172-31-18-73:~/tests# ps x | less 
...
    168 pts/0    Tl     0:00 nodejs --eval "use strict"; process.stdin.setEncoding("utf8"); var incoming = ""; var firstInput = true; var context = {};  process.stdin.on("data", function(chunk) {   incoming += chunk;   var i = incoming.indexOf("\n");   while (i > -1) {     try{       var input = incoming.substr(0, i);       incoming = incoming.substr(i+1);       var fn = JSON.parse(input);       if(firstInput){         context = require("vm").runInNewContext(fn, {});       }       else{         process.stdout.write(JSON.stringify(require("vm").runInNewContext(fn, context)) + "\n");       }     }     catch(e){       console.error(e);     }     if(firstInput){       firstInput = false;     }     else{       /*strings to indicate the process has finished*/       console.log("r1cepzbhUTxtykz5XTC4");       console.error("r1cepzbhUTxtykz5XTC4");     }      i = incoming.indexOf("\n");   } }); process.stdin.on("end", process.exit);

After installing a local copy of cwltool and making the following changes here, the nodejs processes seem to exit properly for both cwltool and toil-cwl-runner:

    stdin_buf.close()
    nodejs.stdin.close()
...

    nodejs.wait()

I suspect the stdin can be automatically closed on some systems but not all. Please let me know if this fixes your issue as well.

On a separate note, we also realized that we might not want Toil to do this check, so we might remove this loop and use a better init process so it can properly reap the zombie children.

w-gao commented 2 years ago

BTW, you might need to uninstall the cwltool installed by Toil before re-installing a local copy. Something like:

mkdir tests && cd tests

# set up a fresh virtual environment
virtualenv -p 3.8 venv && . venv/bin/activate

pip install -e git+https://github.com/DataBiosphere/toil.git@10c591b509d24d45cf58830e748e6a4e7f5e4f60#egg=toil[cwl]
pip uninstall -y cwltool

git clone https://github.com/common-workflow-language/cwltool && cd cwltool
git checkout 0209b0b7ce66f03c8498b5a686f8d31690a2acb3  # latest cwltool==3.1.20211107152837

# temp fix for cwltool
sed -i '294s/nodejs.poll()/nodejs.wait()/' cwltool/sandboxjs.py
sed -i '291i \ \ \ \ nodejs.stdin.close()' cwltool/sandboxjs.py

python setup.py install
cd ..

# run your cwl file with `toil-cwl-runner`

etiennepericat commented 2 years ago

are you running this on a Toil cluster / linux environment by any chance?

We are running toil across multiple servers with Mesos.

I suspect the stdin can be automatically closed on some systems but not all. Please let me know if this fixes your issue as well.

Great ! It works like a charm.

We also noticed zombie children before but it was not a real issue at that time.

In any case, thank you very much both for your help :+1:

adamnovak commented 2 years ago

Yeah, the problem is a weird interaction:

cwltool lets its nodejs children become zombies because it doesn't wait() on them
The container PID 1 is a Mesos process that isn't a well-behaved init replacement and never wait()s to reap zombies
Toil loops on killing the process group it started for the job until all the processes in it are gone, which never happens if some of them become zombies and PID 1 never reaps them.

w-gao commented 2 years ago

Glad to hear it works @EtiennePer!

@adamnovak I have to test this but I think passing --init to the docker run command would enable tini, which should be able to reap zombies?

@mr-c Should I also create an issue and a PR to wait() on the nodejs process here in cwltool? I'm pretty sure the intended behavior is to close the stdin and wait for the process to exit.

mr-c commented 2 years ago

@mr-c Should I also create an issue and a PR to wait() on the nodejs process here in cwltool? I'm pretty sure the intended behavior is to close the stdin and wait for the process to exit.

No. It is kept open on purpose, we re-use the nodejs process to save on start up costs. However, they should have been killed when cwltool ends; there is the real bug. Maybe toil_worker exits don't trigger the cwltool nodejs process cleanup code?

w-gao commented 2 years ago

@mr-c Ah, we did realize that the _terminate_processes() function is never called by Toil, which I think is the nodejs process cleanup code? But then I don't see how that function calls process.wait() and process.kill() to clean up the processes that are not docker runs.

mr-c commented 2 years ago

@mr-c Ah, we did realize that the _terminate_processes() function is never called by Toil, which I think is the nodejs process cleanup code?

We use processes_to_kill

https://github.com/common-workflow-language/cwltool/blob/041cc0eb8f0272846e5b3e685fe51367f3ef93a6/cwltool/sandboxjs.py#L18

https://github.com/common-workflow-language/cwltool/blob/041cc0eb8f0272846e5b3e685fe51367f3ef93a6/cwltool/utils.py#L58

Where each nodeja subprocess is appended to this, both non-containerized

https://github.com/common-workflow-language/cwltool/blob/041cc0eb8f0272846e5b3e685fe51367f3ef93a6/cwltool/sandboxjs.py#L76

Or containerized

https://github.com/common-workflow-language/cwltool/blob/041cc0eb8f0272846e5b3e685fe51367f3ef93a6/cwltool/sandboxjs.py#L159

So toil-cwl-runner and _toil_worker both need to call https://github.com/common-workflow-language/cwltool/blob/041cc0eb8f0272846e5b3e685fe51367f3ef93a6/cwltool/main.py#L103 from their shutdown routines. And also from an interrupt handler like https://github.com/common-workflow-language/cwltool/blob/041cc0eb8f0272846e5b3e685fe51367f3ef93a6/cwltool/main.py#L137 Via https://github.com/common-workflow-language/cwltool/blob/041cc0eb8f0272846e5b3e685fe51367f3ef93a6/cwltool/main.py#L1460

But then I don't see how that function calls process.wait() and process.kill() to clean up the processes that are not docker runs.

Ah-ha, good point. We need to fix that in cwltool , yes

mr-c commented 2 years ago

Okay, in https://github.com/common-workflow-language/cwltool/pull/1577 I make cwltool better behaved about the child process, but only if cwltool.main._terminate_processes() is called; which toil-cwl-runner does not do.

@adamnovak or @w-gao I don't see how to hook in an addition handler, as Toil.__enter__ sets up its own handler and discards the previous one. Can you adjust that behaviour, or add place where I can pass in another function to call as well?

w-gao commented 2 years ago

Yeah, it's probably not the best idea to discard the previous handler here. Would something like this work better?

# Toil.__enter__:
        user_handler = signal.getsignal(signal.SIGTERM)

        def signal_handler(signum: int, frame: Any):
            try:
                if callable(user_handler):
                    user_handler(signum, frame)
            finally:
                # This will make sure __exit__ gets called when we get a SIGTERM signal.
                sys.exit(1)
        signal.signal(signal.SIGTERM, signal_handler)

This would only work if the user defines a handler before the with Toil(options) block. Or, perhaps we could make the Toil context manager accept a callback function to call when sigterm is received?

mr-c commented 2 years ago

This would only work if the user defines a handler before the with Toil(options) block. Or, perhaps we could make the Toil context manager accept a callback function to call when sigterm is received?

Why not both?

What I'm curious about, is if the user provided handler will be called from _toil_worker processes as they exit

w-gao commented 2 years ago

I did some testing and I don't think the signal handler in Toil.__enter__ would override the handlers in the _toil_worker process. (I also started https://github.com/DataBiosphere/toil/compare/issues/3965-user-provided-exit-handler but I don't think it'll help much here)

However, the problem is that a sigterm signal on the leader process might not propagate to its workers, so I don't know the best way to handle interrupts like this on the worker. I can look more into this after winter break.

adamnovak commented 2 years ago

I think Toil.__enter__ only runs in the leader process, while cwltool javascript is generally going to run in the worker, right?

I don't think we need to make a special effort in the leader to reap child processes when we are killed. And trying to reap children when the worker is killed is really just a progressive enhancement.

I think what we really need is just a finally on this try that will call cwltool's cleanup function after the worker runs all its jobs, if cwltool is installed.

If we want the worker to make an attempt to reap child processes when it itself is killed, we could try using the daemon module which has a system to run a cleanup function at exit, on a signal, and at the end of a block. And if we want the leader to do it, we can add the cwltool cleanup call to atexit somewhere in the leader.

Making Toil.__enter__ add its signal handler to a chain instead of just replacing what's there makes sense, since people might not expect it to do that. Making it take a user signal handler itself might be extra machinery we don't really need.

mr-c commented 2 years ago

Thanks @adamnovak

I think what we really need is just a finally on this try that will call cwltool's cleanup function after the worker runs all its jobs, if cwltool is installed.

Great, I'll make a PR for this

DataBiosphere / toil

Job never stop with javascript input #3965