When there are long chains of fast-completing tasks, errors like this can happen, followed by a hang:
1717587994.268657 2024-06-05 11:46:34 MainProcess-318 MainThread-123941245904704 parsl.dataflow.dflow:690 launch_if_ready ERROR: add_done_cal
lback got an exception which will be ignored
Traceback (most recent call last):
File "/home/benc/parsl/src/parsl/parsl/dataflow/dflow.py", line 655, in launch_if_ready
exec_fu = self.launch_task(task_record)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/benc/parsl/src/parsl/parsl/dataflow/dflow.py", line 756, in launch_task
self._log_std_streams(task_record)
File "/home/benc/parsl/src/parsl/parsl/dataflow/dflow.py", line 1447, in _log_std_streams
log_std_stream("Standard out", task_record['app_fu'].stdout)
File "/home/benc/parsl/src/parsl/parsl/dataflow/dflow.py", line 1433, in log_std_stream
logger.info(f"{name} for task {tid} will not be redirected.")
File "/usr/local/lib/python3.12/logging/__init__.py", line 1539, in info
self._log(INFO, msg, args, **kwargs)
File "/usr/local/lib/python3.12/logging/__init__.py", line 1684, in _log
self.handle(record)
File "/usr/local/lib/python3.12/logging/__init__.py", line 1700, in handle
self.callHandlers(record)
File "/usr/local/lib/python3.12/logging/__init__.py", line 1762, in callHandlers
hdlr.handle(record)
File "/usr/local/lib/python3.12/logging/__init__.py", line 1028, in handle
self.emit(record)
File "/usr/local/lib/python3.12/logging/__init__.py", line 1280, in emit
StreamHandler.emit(self, record)
File "/usr/local/lib/python3.12/logging/__init__.py", line 1160, in emit
msg = self.format(record)
^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/logging/__init__.py", line 999, in format
return fmt.format(record)
^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/logging/__init__.py", line 706, in format
s = self.formatMessage(record)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/logging/__init__.py", line 675, in formatMessage
return self._style.format(record)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/logging/__init__.py", line 464, in format
return self._format(record)
^^^^^^^^^^^^^^^^^^^^
RecursionError: maximum recursion depth exceeded
As far as I can see this happens because dependency callbacks, mediated by concurrent.futures.Future can happen in two ways, and the above happens only in the rarest of those two cases.
when Future.set_result is called on a Future with callbacks, those callbacks are invoked inside set_result.
when a callback is added using add_done_callback to a Future which already has a result, that callback runs immediately inside add_done_callback
In Parsl, those callbacks can cause tasks to begin executing. It is possible for tasks to complete very fast - one example is with failed tasks where a task completes with a dependency execution, but I have also seen it happen with ThreadPoolExecutor with very short tasks (for example, extracting a single data structure element using []).
What I think is happening here is that when those tasks complete fast, the callback chains all happen recursively inside each other (with Parsl task launch functions in-between each one), so that the Python call stack is used in proportion to the dependency depth.
If so, I think making these callbacks happen near the top of a fresh stack (for example, as new concurrent.futures.ThreadPoolExecutor tasks, rather than directly inside a callback) would avoid this problem.
To Reproduce
Here is a test program. This will generally pass for n < 140 and fail for larger values. Using sys.setrecursionlimit to increase the Python maximum recursion depth allows n to be made larger (and that is a workaround in some situations for this bug).
from concurrent.futures import Future
import parsl
from parsl.executors.base import ParslExecutor
from typing import Any, Callable, Dict
class ImmediateExecutor(ParslExecutor):
def start(self):
pass
def shutdown(self):
pass
def submit(self, func: Callable, resource_specification: Dict[str, Any], *args: Any, **kwargs: Any) -> Future:
fut = Future()
res = func(*args, **kwargs)
fut.set_result(res)
return fut
fut = Future()
n = 284
here = fut
@parsl.python_app
def chain(upstream):
pass
with parsl.load(parsl.Config(executors=[ImmediateExecutor()])):
for _ in range(n):
here = chain(here)
print("setting result")
# fut.set_exception(RuntimeError())
fut.set_result(None)
print("waiting for final")
print(here.result())
print("exited context manager")
Describe the bug
When there are long chains of fast-completing tasks, errors like this can happen, followed by a hang:
As far as I can see this happens because dependency callbacks, mediated by
concurrent.futures.Future
can happen in two ways, and the above happens only in the rarest of those two cases.Future.set_result
is called on aFuture
with callbacks, those callbacks are invoked insideset_result
.add_done_callback
to aFuture
which already has a result, that callback runs immediately insideadd_done_callback
In Parsl, those callbacks can cause tasks to begin executing. It is possible for tasks to complete very fast - one example is with failed tasks where a task completes with a dependency execution, but I have also seen it happen with
ThreadPoolExecutor
with very short tasks (for example, extracting a single data structure element using[]
).What I think is happening here is that when those tasks complete fast, the callback chains all happen recursively inside each other (with Parsl task launch functions in-between each one), so that the Python call stack is used in proportion to the dependency depth.
If so, I think making these callbacks happen near the top of a fresh stack (for example, as new
concurrent.futures.ThreadPoolExecutor
tasks, rather than directly inside a callback) would avoid this problem.To Reproduce
Here is a test program. This will generally pass for
n
< 140 and fail for larger values. Using sys.setrecursionlimit to increase the Python maximum recursion depth allows n to be made larger (and that is a workaround in some situations for this bug).Expected behavior
much bigger potential depth of dependencies
Environment parsl
master
bf98e5049c34eaf1400663404e29672b85d3ac5e