Account for time spent in garbage collection

jobh commented 1 month ago

Avoid flaky DeadlineExceeded due to garbage collection.

See https://github.com/HypothesisWorks/hypothesis/issues/3975#issuecomment-2102995716.

jobh commented 1 month ago

~~Current test failures seem unrelated, example:~~

[edit] No, it wasn't unrelated, there is just randomness creeping into the tracing as well. ~~@Zac-HD is there a way to selectively turn off tracing in the callback?~~ See commit below

(Pdb++) print(expected.strip())
Falsifying example: test_inquisitor_comments_basic_fail_if_not_all(
    # The test sometimes passed when commented parts were varied together.
    a='',  # or any other generated value
    b='',  # or any other generated value
    c='',  # or any other generated value
)
(Pdb++) print(got)
Falsifying example: test_inquisitor_comments_basic_fail_if_not_all(
    # The test sometimes passed when commented parts were varied together.
    a='',  # or any other generated value
    b='',  # or any other generated value
    c='',  # or any other generated value
)
Explanation:
    These lines were always and only run by failing examples:
        /home/jobh/mambaforge/envs/hypothesis/lib/python3.11/reprlib.py:56
        /home/jobh/mambaforge/envs/hypothesis/lib/python3.11/reprlib.py:61
        /home/jobh/src/pytest/src/_pytest/assertion/util.py:66
        /home/jobh/src/pytest/src/_pytest/assertion/util.py:72
        /home/jobh/src/pytest/src/_pytest/assertion/util.py:84

jobh commented 1 month ago

All tests passed, finally. There were more complications than I expected, the two major ones are commented above.

The scrutineer part is probably just due to more variation in traces; which can be seen as a good thing, as the variation covers more of real-world usage (although some globbing would be nice to have in that list).

The recursion depth thing ~~may or may not be a problem in practice, I just don't know. I set out to kill one source of flakiness, only to create another. Not happy with that one~~ is ok actually.

Anyway: Ready for re-review :-)

Zac-HD commented 1 month ago

(jumping on a flight to PyCon now, but the Scrutineer thing is not a problem, and I'll aim to review the unraisable problem later)

jobh commented 1 month ago

(jumping on a flight to PyCon now, but the Scrutineer thing is not a problem, and I'll aim to review the unraisable problem later)

Thanks @Zac-HD! Have fun at PyCon!

Zac-HD commented 1 month ago

(just want to note that I haven't forgotten about this; it's just tricky enough that I need a dedicated chunk of time to think about it and make sure I've tested all the edge cases. thanks for your patience 🙏)

Zac-HD commented 1 month ago

Apparently GitHub didn't send my review correctly last week??? Sorry about that, I should have spotted it.

jobh commented 1 month ago

Aha, I guess the comments are batched up until the review is completed. No worries from my side, just a bit of extra work for you to fix the things I neved did :grin:

Zac-HD commented 1 month ago

I'm also somewhat concerned by the number of failures I'm seeing on e5a23905210c04d13e9f98c63a5e38bd065d00e1, which looked like a very small change - crashed worker on Windows and linux, this weird error on several builds, prints-on-healthcheck tests on several builds...

It looks like several of these would be fixable, but I'm wondering whether the benefits of measuring GC time are actually worth the fragility vs e.g. running gc.collect() before each 100th test execution.

jobh commented 1 month ago

It looks like several of these would be fixable, but I'm wondering whether the benefits of measuring GC time are actually worth the fragility vs e.g. running gc.collect() before each 100th test execution.

Yep. Let's wait a bit with this, and then when I have more time I'll try to understand what happens and why. The actual execution-time change seems so small, but sometimes it just doesn't work.

jobh commented 1 month ago

I think it's ok actually.

The NaN infects every timing result afterwards, which is why deadlines fail. That's fixable.

While I don't fully understand the crashes, I believe they are likely caused by memory exhaustion due to the new test's deep recursion inside gc. We need to do the recursion outside and then probe gc at each stack level, otherwise we're not testing the shallowest-level callback properly.

[edit: the test still crashes, even with no callback attached (i.e., no gc accounting)]

jobh commented 1 month ago

I spent some time down the rabbit hole of stack exhaustion.

My takeaway is that determinism at the edge of the stack is by chance only. On CPython, things like trivial comparisons require a stack frame on the first few executions and thereafter not. On PyPy, the first pass consumes a lot more stack frames, probably for JITting. Et cetera. I'll paste a basic test below that is flaky on both implementations, without involving hypothesis at all.

I also found that if a hypothesis test fails, the final inner-test execution has a much lower recursion limit, so there's probably a call site we've forgotten to ensure_free_stackframes. I'll paste a test for that as well. [added note: this is when the test is executed directly by core.execute_once instead of via ConjectureRunner.__stoppable_test_function]

Test 1 (no hypothesis)

import gc

import pytest

@pytest.fixture(autouse=True)
def no_gc_collect():
    # Ensure that there are no collections during the test
    # (it makes no difference, so this is just for clarity)
    gc.disable()
    yield
    gc.enable()

def probe_depth():
    try:
        return probe_depth() + 1
    except RecursionError:
        return 0

def test_consistent_stack_depth():  # Fails on PyPy
    assert probe_depth() == probe_depth()

@pytest.mark.parametrize("_", range(5))
def test_consistent_recursionerror(_):  # Fails on CPython, from the third execution
    def probe_depth():
        try:
            return probe_depth() + 1
        except RecursionError:
            return 0

    def at_depth(depth, fn):
        if depth <= 1:  # save 1 frame for fn()
            return fn()
        else:
            return at_depth(depth - 1, fn)

    def test():
        # On CPython, this conditional requires an extra stack frame
        # the first few executions of the test, thereafter not.
        if 0 < 0:
            pass

    max_depth = probe_depth()

    at_depth(max_depth - 1, test)

    with pytest.raises(RecursionError):
        at_depth(max_depth, test)

→

===================================================== short test summary info =====================================================
FAILED test_stack.py::test_consistent_recursionerror[3] - Failed: DID NOT RAISE <class 'RecursionError'>
FAILED test_stack.py::test_consistent_recursionerror[4] - Failed: DID NOT RAISE <class 'RecursionError'>
=================================================== 2 failed, 4 passed in 0.09s ===================================================

Test 2 (recursion limit changes)

import sys

import pytest

from hypothesis import given, strategies as st

def test_consistent_stack_depth():
    limits = []

    @given(st.booleans())
    def inner(_):
        limits.append(sys.getrecursionlimit())
        assert False

    try:
        inner()
    except Exception:
        pass

    assert all(l == limits[0] for l in limits), limits

→

===================================================== short test summary info =====================================================
FAILED test_hypothesis_stack.py::test_consistent_stack_depth - AssertionError: [2042, 2041, 2045, 1000]
======================================================== 1 failed in 0.10s ========================================================

jobh commented 1 month ago

So what does this all mean for this PR? I think it's safe, with the following remarks:

We should accept that we don't account for at-the-edge collections, not try to mark as NaN or anything fancy. It's a tiny corner case, so it's not worth the effort — and we can't prove it to be fully deterministic anyway.
The UnraisableException problem is solved.
I don't know why that new test crashes workers, but it seems to be the test itself that does it, not the GC accounting.

That said, I don't want to be pushy on this subject. If we suspect it will be a maintenance headache, we should drop it. What's your gut feeling, cleanup-and-finish or drop?

Zac-HD commented 1 month ago

If we can get to a point where the code + tests look clean and the tests pass, I'll be happy to merge it 🙂

We're not making any public-API changes, so we can always revert later if this turns out to be a bad idea.

jobh commented 1 month ago

@Zac-HD I think it's ready now.

Review guidance:

There are no changes in the users of gc_cumulative_time(), i.e. the actual accounting.
The "freeze time in conftest.py" patching is reimplemented as separate gc callbacks, to flatten the stack directly and also by avoiding callback indirection.
The new recursion test is completely rewritten.

Additionally,

I wrapped the actual test execution (inside execute_once) with ensure_free_stackframes, for maximal consistency. Now, it turned out this wasn't enough to remove it from conjecture/engine.py, since the one in engine.py also protects data generation. Rather than trying to narrow down exactly where the stack is needed, I just left it in place.
Would you support globbing the scrutineer ignores, something like https://github.com/jobh/hypothesis/commit/5ddfd2e1e88a52a363fdd6a13ccd544d73383e11 (perhaps additionally simplified by not supporting "**")?
Globs in scrutineer ignores
New env variable (for internal use) HYPOTHESIS_NO_TRACEBACK_TRIM. Running in verbose mode is not a realistic option for deeply recursive strategies, but I still sometimes want to see the full stack trace. (I think I just patched it out when working on recursion earlier)

jobh commented 1 month ago

Btw, I think the remaining failure on win-3.9 is unrelated and will go away by itself, but it looks slightly interesting so I'll record it for posterity

  File "D:\a\hypothesis\hypothesis\hypothesis-python\tests\nocover\test_duplication.py", line 61, in test_mostly_does_not_duplicate_blocks_even_when_failing
    assert set(counts.values()) == {1, 2, 3}
AssertionError: assert {1, 4} == {1, 2, 3}

tybug commented 1 month ago

New env variable (for internal use) HYPOTHESIS_NO_TRACEBACK_TRIM

Yes please! I've hacked this locally many times while tracking down internal errors.

test_mostly_does_not_duplicate_blocks_even_when_failing

undoubtedly from #3962. We can just add {1, 4} as a possibility there. I'll do so in #4007 if you don't here

jobh commented 1 month ago

undoubtedly from #3962. We can just add {1, 4} as a possibility there. I'll do so in #4007 if you don't here

Thanks @tybug ! I think #4007 is the better home for it.

HypothesisWorks / hypothesis

Account for time spent in garbage collection #3979