Parsl / parsl

Parsl - a Python parallel scripting library
http://parsl-project.org
Apache License 2.0
489 stars 194 forks source link

`RecursionError: maximum recursion depth exceeded` on long chains of fast-completing tasks #3472

Closed benclifford closed 3 months ago

benclifford commented 3 months ago

Describe the bug

When there are long chains of fast-completing tasks, errors like this can happen, followed by a hang:

1717587994.268657 2024-06-05 11:46:34 MainProcess-318 MainThread-123941245904704 parsl.dataflow.dflow:690 launch_if_ready ERROR: add_done_cal
lback got an exception which will be ignored
Traceback (most recent call last):
  File "/home/benc/parsl/src/parsl/parsl/dataflow/dflow.py", line 655, in launch_if_ready
    exec_fu = self.launch_task(task_record)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/benc/parsl/src/parsl/parsl/dataflow/dflow.py", line 756, in launch_task
    self._log_std_streams(task_record)
  File "/home/benc/parsl/src/parsl/parsl/dataflow/dflow.py", line 1447, in _log_std_streams
    log_std_stream("Standard out", task_record['app_fu'].stdout)
  File "/home/benc/parsl/src/parsl/parsl/dataflow/dflow.py", line 1433, in log_std_stream
    logger.info(f"{name} for task {tid} will not be redirected.")
  File "/usr/local/lib/python3.12/logging/__init__.py", line 1539, in info
    self._log(INFO, msg, args, **kwargs)
  File "/usr/local/lib/python3.12/logging/__init__.py", line 1684, in _log
    self.handle(record)
  File "/usr/local/lib/python3.12/logging/__init__.py", line 1700, in handle
    self.callHandlers(record)
  File "/usr/local/lib/python3.12/logging/__init__.py", line 1762, in callHandlers
    hdlr.handle(record)
  File "/usr/local/lib/python3.12/logging/__init__.py", line 1028, in handle
    self.emit(record)
  File "/usr/local/lib/python3.12/logging/__init__.py", line 1280, in emit
    StreamHandler.emit(self, record)
  File "/usr/local/lib/python3.12/logging/__init__.py", line 1160, in emit
    msg = self.format(record)
          ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/logging/__init__.py", line 999, in format
    return fmt.format(record)
           ^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/logging/__init__.py", line 706, in format
    s = self.formatMessage(record)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/logging/__init__.py", line 675, in formatMessage
    return self._style.format(record)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/logging/__init__.py", line 464, in format
    return self._format(record)
           ^^^^^^^^^^^^^^^^^^^^
RecursionError: maximum recursion depth exceeded

As far as I can see this happens because dependency callbacks, mediated by concurrent.futures.Future can happen in two ways, and the above happens only in the rarest of those two cases.

  1. when Future.set_result is called on a Future with callbacks, those callbacks are invoked inside set_result.
  2. when a callback is added using add_done_callback to a Future which already has a result, that callback runs immediately inside add_done_callback

In Parsl, those callbacks can cause tasks to begin executing. It is possible for tasks to complete very fast - one example is with failed tasks where a task completes with a dependency execution, but I have also seen it happen with ThreadPoolExecutor with very short tasks (for example, extracting a single data structure element using []).

What I think is happening here is that when those tasks complete fast, the callback chains all happen recursively inside each other (with Parsl task launch functions in-between each one), so that the Python call stack is used in proportion to the dependency depth.

If so, I think making these callbacks happen near the top of a fresh stack (for example, as new concurrent.futures.ThreadPoolExecutor tasks, rather than directly inside a callback) would avoid this problem.

To Reproduce

Here is a test program. This will generally pass for n < 140 and fail for larger values. Using sys.setrecursionlimit to increase the Python maximum recursion depth allows n to be made larger (and that is a workaround in some situations for this bug).

from concurrent.futures import Future

import parsl

from parsl.executors.base import ParslExecutor

from typing import Any, Callable, Dict

class ImmediateExecutor(ParslExecutor):
    def start(self):
        pass

    def shutdown(self):
        pass

    def submit(self, func: Callable, resource_specification: Dict[str, Any], *args: Any, **kwargs: Any) -> Future:
        fut = Future()
        res = func(*args, **kwargs)
        fut.set_result(res)
        return fut

fut = Future()

n = 284

here = fut

@parsl.python_app
def chain(upstream):
    pass

with parsl.load(parsl.Config(executors=[ImmediateExecutor()])):
    for _ in range(n):
        here = chain(here)

    print("setting result")
    # fut.set_exception(RuntimeError())
    fut.set_result(None)

    print("waiting for final")

    print(here.result())

print("exited context manager")

Expected behavior

much bigger potential depth of dependencies

Environment parsl master bf98e5049c34eaf1400663404e29672b85d3ac5e

benclifford commented 3 months ago

added https://github.com/Parsl/parsl/labels/safe-exit because this results in a hang rather than a failure and exit