common-workflow-language / cwltool

Common Workflow Language reference implementation
https://cwltool.readthedocs.io/
Apache License 2.0
324 stars 225 forks source link

Evaluation expressions that timeout can raise JSONDecodeError #1999

Open brettcs opened 1 month ago

brettcs commented 1 month ago

Expected Behavior

When an evaluation expression exceeds the timeout, the resulting error should be clear that is the cause.

Actual Behavior

At least some of the time, this situation raises a JSONDecodeError exception. If the user doesn't realize the timeout might be the issue, this can cause them to lose a lot of time trying to debug an issue in the expression itself where none exists.

This is sort of a sequel issue to #1680.

Workflow Code

eval-timeout.zip

Full Traceback

This issue was noticed with a "real" expression that similarly built a large array of an array of files, and hit the default evaluation timeout. However, you can reproduce the issue using the attached minimal reproduction and a small evaluation timeout:

% venv/bin/cwltool --debug --eval-timeout=.1 wf.cwl array-in.yml
INFO venv/bin/cwltool 3.1.20240404144621
INFO Resolved 'wf.cwl' to 'file:///home/brett/Scratch/wf.cwl'
DEBUG [workflow ] initialized from file:///home/brett/Scratch/wf.cwl
INFO [workflow ] start
DEBUG [workflow ] inputs {}
INFO [workflow ] starting step build_array
DEBUG [step build_array] job input {}
DEBUG [step build_array] evaluated job input to {}
INFO [step build_array] start
Expecting value: line 1 column 1 (char 0)
script was:
01 "use strict";
02 var inputs = {};
03 var self = null;
04 var runtime = {
05     "cores": 1,
06     "ram": 256,
07     "tmpdirSize": 1024,
08     "outdirSize": 1024,
09     "tmpdir": null,
10     "outdir": null
11 };
12 (function(){
13  var result = [];
14  for (var i=0; i<100000; i++) {
15    var arr = [];
16    result.push(arr);
17    for (var j=0; j<100000; j++) {
18      var num = i * 100000 + j;
19      arr.push({
20        class: "File",
21        location: "file:///tmp/array.cwl/file" + num.toString() + ".fq",
22        format: "http://edamontology.org/format_1930"
23      });
24    }
25  }
26  return {"out_arr": result};
27 })()
stdout was: ''
stderr was: ''
Traceback (most recent call last):
  File "/home/brett/Scratch/venv/lib/python3.12/site-packages/cwl_utils/sandboxjs.py", line 512, in eval
    return cast(CWLOutputType, json.loads(stdout))
                               ^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/json/__init__.py", line 346, in loads
    return _default_decoder.decode(s)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/json/decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

The above exception was the direct cause of the following exception:

[Note: Everything from here on is basically propagation of the same root cause exception, but here it is for completeness:]

Traceback (most recent call last):
  File "/home/brett/Scratch/venv/lib/python3.12/site-packages/cwl_utils/expression.py", line 315, in do_eval
    return interpolate(
           ^^^^^^^^^^^^
  File "/home/brett/Scratch/venv/lib/python3.12/site-packages/cwl_utils/expression.py", line 215, in interpolate
    e = evaluator(
        ^^^^^^^^^^
  File "/home/brett/Scratch/venv/lib/python3.12/site-packages/cwl_utils/expression.py", line 161, in evaluator
    return cast(CWLOutputType, js_engine.eval(ex, jslib, **kwargs))
                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/brett/Scratch/venv/lib/python3.12/site-packages/cwl_utils/sandboxjs.py", line 514, in eval
    raise JavascriptException(
cwl_utils.errors.JavascriptException: Expecting value: line 1 column 1 (char 0)
script was:
01 "use strict";
02 var inputs = {};
03 var self = null;
04 var runtime = {
05     "cores": 1,
06     "ram": 256,
07     "tmpdirSize": 1024,
08     "outdirSize": 1024,
09     "tmpdir": null,
10     "outdir": null
11 };
12 (function(){
13  var result = [];
14  for (var i=0; i<100000; i++) {
15    var arr = [];
16    result.push(arr);
17    for (var j=0; j<100000; j++) {
18      var num = i * 100000 + j;
19      arr.push({
20        class: "File",
21        location: "file:///tmp/array.cwl/file" + num.toString() + ".fq",
22        format: "http://edamontology.org/format_1930"
23      });
24    }
25  }
26  return {"out_arr": result};
27 })()
stdout was: ''
stderr was: ''

ERROR Got workflow error
[error messages and tracebacks repeat from here]

Your Environment

Diagnosis

This occurs because of a few related issues in the way that the expression evaluation timeout is implemented:

In general, there's a lot of code in exec_js_process that seems like it could be handled by the subprocess module directly, following the example code documented with Popen.communicate. A rough sketch would be something like:

def exec_js_process(
    self,
    js_text: str,
    timeout: float = default_timeout,
    ...,
) -> Tuple[int, str, str]:
    ...
    with subprocess.Popen(
            ["nodejs", "--eval", js_text],
            stdin=subprocess.DEVNULL,
            stdout=subprocess.PIPE,
            stderr=subprocess.PIPE,
            text=True,
            encoding='utf-8',
    ) as nodejs:
        try:
            stdoutdata, stderrdata = nodejs.communicate(timeout=timeout)
        except subprocess.TimeoutExpired:
            nodejs.kill()
            stdoutdata, stderrdata = nodejs.communicate()
            returncode = -1
        else:
            returncode = nodejs.returncode
    return returncode, stdoutdata, stderrdata

This would still need code at the top to find the name of the Node command, handle other arguments, etc. But I think this could replace all the process and timeout management code that currently lives in the method, while addressing the bugs above.