Test suite sometimes segfaults due to XQueryExtension() getting NULL Display

mgorny commented 1 year ago

General information:

OS name: Gentoo Linux
OS version: n/a
OS architecture: amd64
Resolutions: n/a
Python version: 3.11.3 (but also reproduced with 3.9.16)
MSS version: 8.0.3 and git ef97a27cbeae46cbed8f348679cd9729b4755fc0

For GNU/Linux users:

Display server protocol and version, if known: Xvfb from xorg-server 21.1.8
Desktop Environment: n/a
Composite Window Manager name and version: n/a

Description of the warning/error

When running the test suite under Xvfb, various tests sometimes cause it to segfault seemingly randomly, e.g.:

$ python -m pytest 
========================================================= test session starts =========================================================
platform linux -- Python 3.11.3, pytest-7.3.1, pluggy-1.0.0 -- /tmp/python-mss/.venv/bin/python
cachedir: .pytest_cache
rootdir: /tmp/python-mss
configfile: setup.cfg
plugins: cov-4.0.0
collected 73 items / 1 skipped                                                                                                        

src/tests/test_bgra_to_rgb.py::test_bad_length PASSED                                                                           [  1%]
src/tests/test_bgra_to_rgb.py::test_good_types PASSED                                                                           [  2%]
src/tests/test_cls_image.py::test_custom_cls_image PASSED                                                                       [  4%]
src/tests/test_find_monitors.py::test_get_monitors PASSED                                                                       [  5%]
src/tests/test_find_monitors.py::test_keys_aio PASSED                                                                           [  6%]
src/tests/test_find_monitors.py::test_keys_monitor_1 PASSED                                                                     [  8%]
src/tests/test_find_monitors.py::test_dimensions PASSED                                                                         [  9%]
src/tests/test_get_pixels.py::test_grab_monitor PASSED                                                                          [ 10%]
src/tests/test_get_pixels.py::test_grab_part_of_screen Fatal Python error: Segmentation fault

Current thread 0x00007f4df6e3c740 (most recent call first):
  File "/tmp/python-mss/src/mss/linux.py", line 359 in _is_extension_enabled
  File "/tmp/python-mss/src/mss/linux.py", line 321 in __init__
  File "/tmp/python-mss/src/mss/factory.py", line 34 in mss
  File "/tmp/python-mss/src/tests/test_get_pixels.py", line 25 in test_grab_part_of_screen
  File "/tmp/python-mss/.venv/lib/python3.11/site-packages/_pytest/python.py", line 194 in pytest_pyfunc_call
  File "/tmp/python-mss/.venv/lib/python3.11/site-packages/pluggy/_callers.py", line 39 in _multicall
  File "/tmp/python-mss/.venv/lib/python3.11/site-packages/pluggy/_manager.py", line 80 in _hookexec
  File "/tmp/python-mss/.venv/lib/python3.11/site-packages/pluggy/_hooks.py", line 265 in __call__
  File "/tmp/python-mss/.venv/lib/python3.11/site-packages/_pytest/python.py", line 1799 in runtest
  File "/tmp/python-mss/.venv/lib/python3.11/site-packages/_pytest/runner.py", line 169 in pytest_runtest_call
  File "/tmp/python-mss/.venv/lib/python3.11/site-packages/pluggy/_callers.py", line 39 in _multicall
  File "/tmp/python-mss/.venv/lib/python3.11/site-packages/pluggy/_manager.py", line 80 in _hookexec
  File "/tmp/python-mss/.venv/lib/python3.11/site-packages/pluggy/_hooks.py", line 265 in __call__
  File "/tmp/python-mss/.venv/lib/python3.11/site-packages/_pytest/runner.py", line 262 in <lambda>
  File "/tmp/python-mss/.venv/lib/python3.11/site-packages/_pytest/runner.py", line 341 in from_call
  File "/tmp/python-mss/.venv/lib/python3.11/site-packages/_pytest/runner.py", line 261 in call_runtest_hook
  File "/tmp/python-mss/.venv/lib/python3.11/site-packages/_pytest/runner.py", line 222 in call_and_report
  File "/tmp/python-mss/.venv/lib/python3.11/site-packages/_pytest/runner.py", line 133 in runtestprotocol
  File "/tmp/python-mss/.venv/lib/python3.11/site-packages/_pytest/runner.py", line 114 in pytest_runtest_protocol
  File "/tmp/python-mss/.venv/lib/python3.11/site-packages/pluggy/_callers.py", line 39 in _multicall
  File "/tmp/python-mss/.venv/lib/python3.11/site-packages/pluggy/_manager.py", line 80 in _hookexec
  File "/tmp/python-mss/.venv/lib/python3.11/site-packages/pluggy/_hooks.py", line 265 in __call__
  File "/tmp/python-mss/.venv/lib/python3.11/site-packages/_pytest/main.py", line 348 in pytest_runtestloop
  File "/tmp/python-mss/.venv/lib/python3.11/site-packages/pluggy/_callers.py", line 39 in _multicall
  File "/tmp/python-mss/.venv/lib/python3.11/site-packages/pluggy/_manager.py", line 80 in _hookexec
  File "/tmp/python-mss/.venv/lib/python3.11/site-packages/pluggy/_hooks.py", line 265 in __call__
  File "/tmp/python-mss/.venv/lib/python3.11/site-packages/_pytest/main.py", line 323 in _main
  File "/tmp/python-mss/.venv/lib/python3.11/site-packages/_pytest/main.py", line 269 in wrap_session
  File "/tmp/python-mss/.venv/lib/python3.11/site-packages/_pytest/main.py", line 316 in pytest_cmdline_main
  File "/tmp/python-mss/.venv/lib/python3.11/site-packages/pluggy/_callers.py", line 39 in _multicall
  File "/tmp/python-mss/.venv/lib/python3.11/site-packages/pluggy/_manager.py", line 80 in _hookexec
  File "/tmp/python-mss/.venv/lib/python3.11/site-packages/pluggy/_hooks.py", line 265 in __call__
  File "/tmp/python-mss/.venv/lib/python3.11/site-packages/_pytest/config/__init__.py", line 166 in main
  File "/tmp/python-mss/.venv/lib/python3.11/site-packages/_pytest/config/__init__.py", line 189 in console_main
  File "/tmp/python-mss/.venv/lib/python3.11/site-packages/pytest/__main__.py", line 5 in <module>
  File "<frozen runpy>", line 88 in _run_code
  File "<frozen runpy>", line 198 in _run_module_as_main

Extension modules: numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, PIL._imaging (total: 14)
Segmentation fault (core dumped)

gdb suggests that XQueryExtension() is receiving a NULL pointer as display:

#0  __pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=11, no_tid=no_tid@entry=0) at pthread_kill.c:44
#1  0x00007f4df66a38df in __pthread_kill_internal (signo=11, threadid=<optimized out>) at pthread_kill.c:78
#2  0x00007f4df6653af2 in __GI_raise (sig=11) at ../sysdeps/posix/raise.c:26
#3  <signal handler called>
#4  XQueryExtension (dpy=0x0, name=0x7f4dd24e2870 "RANDR", major_opcode=0x7f4dd247e920, first_event=0x7f4dd247ea40, 
    first_error=0x7f4dd247e800) at /usr/src/debug/x11-libs/libX11-1.8.4-r1/libX11-1.8.4/src/QuExt.c:48
#5  0x00007f4df521228a in ?? () from /usr/lib64/libffi.so.8
#6  0x00007f4df52116a4 in ?? () from /usr/lib64/libffi.so.8
#7  0x00007f4df5211dfd in ffi_call () from /usr/lib64/libffi.so.8
#8  0x00007f4df5255060 in ?? () from /usr/lib/python3.11/lib-dynload/_ctypes.cpython-311-x86_64-linux-gnu.so
#9  0x00007f4df524e388 in ?? () from /usr/lib/python3.11/lib-dynload/_ctypes.cpython-311-x86_64-linux-gnu.so
#10 0x00007f4df694ec0b in _PyObject_MakeTpCall () from /usr/lib64/libpython3.11.so.1.0
#11 0x00007f4df68fdeb0 in _PyEval_EvalFrameDefault () from /usr/lib64/libpython3.11.so.1.0
[…]

This seems to be some ugly race condition, I'm trying to investigate further.

Other details

I'm testing via:

python -m venv .venv
. .venv/bin/activate
pip install -r tests-requirements.txt
export DISPLAY=:1
Xvfb :1 &
python -m pytest

BoboTiG commented 1 year ago

Yes, I see it on a regular basis on the GitHub CI. But not on my computer. If you can find something to help improving the tests suit robustness, I would be happy to merge a patch, or give a hand :)

mgorny commented 1 year ago

I'm rebuilding Python with debug symbols. What really doesn't seem to make sense is that display is <mss.linux.LP_Display object at 0x7fa3d81daf90>, so it's not None.

BoboTiG commented 1 year ago

I guess it is a underlying object of display that may cause an issue 🤔

BoboTiG commented 1 year ago

Do you reproduce if you add a sleep(1) after Xvfb start?

mgorny commented 1 year ago

Do you reproduce if you add a sleep(1) after Xvfb start?

I have the same Xvfb started two hours ago ;-).

Hmm, i suspect POINTER() may hide an actual null pointer inside. Trying to figure out how to get its address/value.

mgorny commented 1 year ago

Ah, it has a boolean value. So if I add assert self._handles.display just below XOpenDisplay(), I sometimes get a failed assertion — this is sufficient to prevent segfaults but if I understand your comment from the other bug correctly, shouldn't failing self.xlib.XOpenDisplay() trigger an exception then?

BoboTiG commented 1 year ago

I think the null pointer actually breaks Python before the error handler has a chance to do something. It may be a incorrect guess though.

BoboTiG commented 1 year ago

Because the error handler will be called by ctypes after the X function returns something. Here, the null pointer in POINTER() crashes the X function call.

mgorny commented 1 year ago

Actually, I think error handlers may not be used by XOpenDisplay at all — the docs are so bad :-(.

In any case, what do you think about adding:

        if not self._handles.display:                                            
            raise ScreenShotError(f"Unable to open display: {display!r}.")

While it doesn't solve the underlying issue, an explicit exception is still better than segv. I can submit a PR if you wish.

BoboTiG commented 1 year ago

As stated in https://www.x.org/releases/X11R7.7/doc/man/man3/XOpenDisplay.3.xhtml:

If XOpenDisplay does not succeed, it returns NULL.

So yes, let's go with your fix, that seems like the way to go 👍🏻

BoboTiG commented 1 year ago

Having a way to debug why it failed would be cool, but I don't see how we could get more details.

mgorny commented 1 year ago

Oh, I now see that the test suite is using xvfbwrapper — at least partially, that is. I was confused because a lot of tests fail without DISPLAY being set. I'm guessing the problem is indeed that xvfbwrapper.start() returns too soon, and so adding a little delay should reduce the risk of failures.

That said, I'm a bit confused by this — apparently we need a display to run the test suite at all, so why some of the tests use an additional layer of Xvfb? Perhaps the simplest solution would be to just use the outer $DISPLAY there instead of starting Xvfb. That should avoid the problem since the server is started earlier, and would avoid adding delays to all tests.

That said, the bug probably lies in xvfbwrapper itself. FWICS it's using a pretty "obsolete" method of starting Xvfb, compared e.g. to xvfb-run. Unfortunately, the package seems to be dead — there is an open PR actually fixing race conditions that received no reply.

So, what do you think about removing xvfbwrapper use and using the Xvfb instance started as part of GHA? I could also try to switch GHA to use xvfb-run to be even more reliable.

mgorny commented 1 year ago

Hmm, now I see that xvfbwrapper is used to start the display at specific screen size.

In that case, perhaps PyVirtualDisplay could be a better alternative? In any case, the code there seems more robust, the package has had recent commits and it's used by pytest-xvfb, so I suppose it has some popularity.

BoboTiG commented 1 year ago

Hmm, now I see that xvfbwrapper is used to start the display at specific screen size.

In that case, perhaps PyVirtualDisplay could be a better alternative? In any case, the code there seems more robust, the package has had recent commits and it's used by pytest-xvfb, so I suppose it has some popularity.

Good idea 👍🏻

mgorny commented 1 year ago

Should I make a PR to switch or do you want to do it?

BoboTiG commented 1 year ago

Should I make a PR to switch or do you want to do it?

If you have time to do it, let's go :)

mgorny commented 1 year ago

Damn, it seems that pyvirtualdisplay also has race conditions :-(. I suppose I'll try to fix it first then.

BoboTiG commented 1 year ago

If the check you added in the init is enough, then lets keep it as-is. It works pretty well :)

mgorny commented 1 year ago

If the check you added in the init is enough, then lets keep it as-is. It works pretty well :)

Unfortunately, the check only replaces segv with some test failures. Besides, given that xvfbwrapper is unmaintained, I'd like to be able to remove it from Gentoo sooner than later (i.e. before it causes even more issues). I've created #249 to use PyVirtualDisplay — it also makes the code a bit shorter ;-).

BoboTiG commented 1 year ago

Good argument about the removal from Gentoo, you should have started with that ;)

BoboTiG / python-mss

Test suite sometimes segfaults due to XQueryExtension() getting NULL Display #246

Description of the warning/error

Other details