Memray's test suite fails due to a deadlock with Textual 0.67.1

godlygeek commented 2 weeks ago

With Textual 0.67.1, Memray's test suite fails attempting a subprocess.run() of the memray tree subcommand, which runs a Textual TUI. For some reason this is only failing on macOS, and I don't have a mac, so I'm not able to debug this anywhere near as well as I could debug #4634, but @pablogsal has told me it reproduces locally on his. If you've got a mac to try it out on, these steps should reproduce it:

git clone https://github.com/bloomberg/memray.git
python3 -m venv memray-venv
source memray-venv/bin/activate
cd memray
pip install -r requirements-test.txt
pip install memray
pytest tests -v tests/integration/test_main.py -v -k tree -x

When the test suite runs memray tree in a subprocess, it winds up failing with an exception like:

  File "/Users/pgalindo3/github/memray/src/memray/__main__.py", line 6, in <module>
    sys.exit(main())
             ^^^^^^
  File "/Users/pgalindo3/github/memray/src/memray/commands/__init__.py", line 138, in main
    arg_values.entrypoint(arg_values, parser)
  File "/Users/pgalindo3/github/memray/src/memray/commands/tree.py", line 85, in run
    reporter.render()
  File "/Users/pgalindo3/github/memray/src/memray/reporters/tree.py", line 505, in render
    self.get_app().run()
  File "/Users/pgalindo3/.pyenv/versions/3.11.6/lib/python3.11/site-packages/textual/app.py", line 1620, in run
    asyncio.run(run_app())
  File "/Users/pgalindo3/.pyenv/versions/3.11.6/lib/python3.11/asyncio/runners.py", line 190, in run
    return runner.run(main)
           ^^^^^^^^^^^^^^^^
  File "/Users/pgalindo3/.pyenv/versions/3.11.6/lib/python3.11/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/pgalindo3/.pyenv/versions/3.11.6/lib/python3.11/asyncio/base_events.py", line 653, in run_until_complete
    return future.result()
           ^^^^^^^^^^^^^^^
  File "/Users/pgalindo3/.pyenv/versions/3.11.6/lib/python3.11/site-packages/textual/app.py", line 1606, in run_app
    await self.run_async(
  File "/Users/pgalindo3/.pyenv/versions/3.11.6/lib/python3.11/site-packages/textual/app.py", line 1572, in run_async
    await app._shutdown()
  File "/Users/pgalindo3/.pyenv/versions/3.11.6/lib/python3.11/site-packages/textual/app.py", line 2788, in _shutdown
    await self._close_all()
  File "/Users/pgalindo3/.pyenv/versions/3.11.6/lib/python3.11/site-packages/textual/app.py", line 2768, in _close_all
    await self._prune_node(stack_screen)
  File "/Users/pgalindo3/.pyenv/versions/3.11.6/lib/python3.11/site-packages/textual/app.py", line 3418, in _prune_node
    raise asyncio.TimeoutError(
TimeoutError: Timeout waiting for [ToastRack(id='textual-toastrack'), Tooltip(id='textual-tooltip'), Horizontal(), Footer()] to close; possible deadlock

You should be able to reproduce this outside of the test suite by generating a Memray capture file:

memray run -fo output.bin -c "import mmap; mmap.mmap(-1, 1024 * 1024)"

And then running the tree reporter on that capture file in a subprocess:

python -c 'import sys, subprocess; subprocess.check_output([sys.executable, "-m", "memray", "tree", "output.bin"], stderr=subprocess.DEVNULL, text=True, input="q")'

pablogsal commented 2 weeks ago

And then running the tree reporter on that capture file in a subprocess:

You may need to execute this command a couple of times until it reproduces (it looks like some kind of race).

godlygeek commented 2 weeks ago

Ah! It's now reproducing for me on Linux. I swear it wasn't before, but thank heavens for small favors. Previously I tried reproducing it with an existing virtualenv, but it reproduced after creating a new one. pip list shows:

Package                 Version
----------------------- -------
asttokens               2.4.1
coverage                7.5.3
Cython                  3.0.10
decorator               5.1.1
executing               2.0.1
greenlet                3.0.3
iniconfig               2.0.0
ipython                 8.25.0
jedi                    0.19.1
Jinja2                  3.1.4
linkify-it-py           2.0.3
markdown-it-py          3.0.0
MarkupSafe              2.1.5
matplotlib-inline       0.1.7
mdit-py-plugins         0.4.1
mdurl                   0.1.2
memray                  1.12.0
packaging               24.0
parso                   0.8.4
pexpect                 4.9.0
pip                     24.0
pkgconfig               1.5.5
pluggy                  1.5.0
prompt_toolkit          3.0.46
ptyprocess              0.7.0
pure-eval               0.2.2
Pygments                2.18.0
pytest                  8.2.2
pytest-cov              5.0.0
pytest-textual-snapshot 0.4.0
rich                    13.7.1
setuptools              65.5.0
six                     1.16.0
stack-data              0.6.3
syrupy                  4.6.1
textual                 0.67.1
traitlets               5.14.3
typing_extensions       4.12.1
uc-micro-py             1.0.3
wcwidth                 0.2.13

godlygeek commented 2 weeks ago

Adding some debugging to the rlock acquires:

diff --git a/src/textual/rlock.py b/src/textual/rlock.py
index d7a6af2d5..e1acce7fc 100644
--- a/src/textual/rlock.py
+++ b/src/textual/rlock.py
@@ -1,6 +1,10 @@
 from __future__ import annotations

 from asyncio import Lock, Task, current_task
+import datetime
+import traceback
+
+textual_log = open("/tmp/textual.log", "w")

 class RLock:
@@ -16,9 +20,12 @@ class RLock:
         task = current_task()
         assert task is not None
         if self._owner is None or self._owner is not task:
+            print(f"\n{datetime.datetime.now().isoformat()} Acquiring {self} at:", file=textual_log)
+            traceback.print_stack(limit=5, file=textual_log)
             await self._lock.acquire()
             self._owner = task
         self._count += 1
+        print(f"\n{datetime.datetime.now().isoformat()} Lock {self} acquired with count={self._count}", file=textual_log)

     def release(self) -> None:
         """Release a previously acquired lock."""
@@ -28,6 +35,7 @@ class RLock:
         if self._count < 0:
             # Should not occur if every acquire as a release
             raise RuntimeError("RLock.release called too many times")
+        print(f"\n{datetime.datetime.now().isoformat()} Lock {self} released, new count={self._count}", file=textual_log)
         if self._owner is task:
             if not self._count:
                 self._owner = None

and filtering out the locks that were acquired and then successfully released shows these 3 interesting events, the last one happening immediately before a hang of App.CLOSE_TIMEOUT seconds:

2024-06-12T14:27:53.508720 Acquiring <textual.rlock.RLock object at 0x7f6469f4b290> at:
  File "/tmp/memray-issue/lib/python3.11/site-packages/textual/app.py", line 1572, in run_async
    await app._shutdown()
  File "/tmp/memray-issue/lib/python3.11/site-packages/textual/app.py", line 2788, in _shutdown
    await self._close_all()
  File "/tmp/memray-issue/lib/python3.11/site-packages/textual/app.py", line 2763, in _close_all
    async with self._dom_lock:
  File "/tmp/memray-issue/lib/python3.11/site-packages/textual/rlock.py", line 51, in __aenter__
    await self.acquire()
  File "/tmp/memray-issue/lib/python3.11/site-packages/textual/rlock.py", line 24, in acquire
    traceback.print_stack(limit=5, file=textual_log)

2024-06-12T14:27:53.509372 Lock <textual.rlock.RLock object at 0x7f6469f4b290> acquired with count=1

2024-06-12T14:27:53.509662 Acquiring <textual.rlock.RLock object at 0x7f6468d92050> at:
  File "/tmp/memray-issue/lib/python3.11/site-packages/textual/widget.py", line 1130, in recompose
    async with self.batch():
  File "/opt/bb/lib/python3.11/contextlib.py", line 210, in __aenter__
    return await anext(self.gen)
  File "/tmp/memray-issue/lib/python3.11/site-packages/textual/widget.py", line 3487, in batch
    async with self.lock:
  File "/tmp/memray-issue/lib/python3.11/site-packages/textual/rlock.py", line 51, in __aenter__
    await self.acquire()
  File "/tmp/memray-issue/lib/python3.11/site-packages/textual/rlock.py", line 24, in acquire
    traceback.print_stack(limit=5, file=textual_log)

2024-06-12T14:27:53.510188 Lock <textual.rlock.RLock object at 0x7f6468d92050> acquired with count=1

2024-06-12T14:27:53.514009 Acquiring <textual.rlock.RLock object at 0x7f6469f4b290> at:
  File "/opt/bb/lib/python3.11/asyncio/events.py", line 84, in _run
    self._context.run(self._callback, *self._args)
  File "/tmp/memray-issue/lib/python3.11/site-packages/textual/app.py", line 3356, in prune_widgets_task
    await self._prune_nodes(widgets)
  File "/tmp/memray-issue/lib/python3.11/site-packages/textual/app.py", line 3383, in _prune_nodes
    async with self._dom_lock:
  File "/tmp/memray-issue/lib/python3.11/site-packages/textual/rlock.py", line 51, in __aenter__
    await self.acquire()
  File "/tmp/memray-issue/lib/python3.11/site-packages/textual/rlock.py", line 24, in acquire
    traceback.print_stack(limit=5, file=textual_log)

So:

App._close_all acquires self._dom_lock
Widget.recompose (for the Footer widget) acquires self.lock (a different lock), but never releases it
App._prune_nodes tries to acquire self._dom_lock and blocks, because the task calling App._close_all never released it

godlygeek commented 2 weeks ago

OK, here's a minimal reproducer for the issue:

Create a crash_reproducer.py with these contents:

from textual.app import App
from textual.binding import Binding
from textual.widgets import Footer

class MyApp(App[None]):
    BINDINGS = [
        Binding(key="q", action="quit", description="Quit the app"),
    ]

    def compose(self):
        yield Footer()

app = MyApp()
app.run()

And then run it with a "q" passed to stdin:

echo q | python crash_reproducer.py

willmcgugan commented 2 weeks ago

Thanks for doing the legwork on this. Looking at it now.

willmcgugan commented 2 weeks ago

In the meantime, an alternative to piping stdin is to use the TEXTUAL_PRESS env var to simulate keys.

TEXTUAL_PRESS=q python crash_reproducer.py

This is probably better in a test environment, because when stdin is not a tty it can subtly change how the app behaves.

willmcgugan commented 2 weeks ago

Please try Textual 0.68.0

willmcgugan commented 1 week ago

Closing, assumed fix.

github-actions[bot] commented 1 week ago

Don't forget to star the repository!

Follow @textualizeio for Textual updates.

godlygeek commented 1 week ago

Sorry, yep, I can confirm it's fixed. And we've switched to TEXTUAL_PRESS as well, so hopefully our tests are more robust in the future.

Textualize / textual

Memray's test suite fails due to a deadlock with Textual 0.67.1 #4643