Closed jobh closed 3 weeks ago
Current test failures seem unrelated, example:
[edit] No, it wasn't unrelated, there is just randomness creeping into the tracing as well. @Zac-HD is there a way to selectively turn off tracing in the callback? See commit below
(Pdb++) print(expected.strip())
Falsifying example: test_inquisitor_comments_basic_fail_if_not_all(
# The test sometimes passed when commented parts were varied together.
a='', # or any other generated value
b='', # or any other generated value
c='', # or any other generated value
)
(Pdb++) print(got)
Falsifying example: test_inquisitor_comments_basic_fail_if_not_all(
# The test sometimes passed when commented parts were varied together.
a='', # or any other generated value
b='', # or any other generated value
c='', # or any other generated value
)
Explanation:
These lines were always and only run by failing examples:
/home/jobh/mambaforge/envs/hypothesis/lib/python3.11/reprlib.py:56
/home/jobh/mambaforge/envs/hypothesis/lib/python3.11/reprlib.py:61
/home/jobh/src/pytest/src/_pytest/assertion/util.py:66
/home/jobh/src/pytest/src/_pytest/assertion/util.py:72
/home/jobh/src/pytest/src/_pytest/assertion/util.py:84
All tests passed, finally. There were more complications than I expected, the two major ones are commented above.
The scrutineer part is probably just due to more variation in traces; which can be seen as a good thing, as the variation covers more of real-world usage (although some globbing would be nice to have in that list).
The recursion depth thing may or may not be a problem in practice, I just don't know. I set out to kill one source of flakiness, only to create another. Not happy with that one is ok actually.
Anyway: Ready for re-review :-)
(jumping on a flight to PyCon now, but the Scrutineer thing is not a problem, and I'll aim to review the unraisable problem later)
(jumping on a flight to PyCon now, but the Scrutineer thing is not a problem, and I'll aim to review the unraisable problem later)
Thanks @Zac-HD! Have fun at PyCon!
(just want to note that I haven't forgotten about this; it's just tricky enough that I need a dedicated chunk of time to think about it and make sure I've tested all the edge cases. thanks for your patience 🙏)
Apparently GitHub didn't send my review correctly last week??? Sorry about that, I should have spotted it.
Aha, I guess the comments are batched up until the review is completed. No worries from my side, just a bit of extra work for you to fix the things I neved did :grin:
I'm also somewhat concerned by the number of failures I'm seeing on e5a23905210c04d13e9f98c63a5e38bd065d00e1, which looked like a very small change - crashed worker on Windows and linux, this weird error on several builds, prints-on-healthcheck tests on several builds...
It looks like several of these would be fixable, but I'm wondering whether the benefits of measuring GC time are actually worth the fragility vs e.g. running gc.collect()
before each 100th test execution.
It looks like several of these would be fixable, but I'm wondering whether the benefits of measuring GC time are actually worth the fragility vs e.g. running
gc.collect()
before each 100th test execution.
Yep. Let's wait a bit with this, and then when I have more time I'll try to understand what happens and why. The actual execution-time change seems so small, but sometimes it just doesn't work.
I think it's ok actually.
The NaN
infects every timing result afterwards, which is why deadlines fail. That's fixable.
While I don't fully understand the crashes, I believe they are likely caused by memory exhaustion due to the new test's deep recursion inside gc. We need to do the recursion outside and then probe gc at each stack level, otherwise we're not testing the shallowest-level callback properly.
[edit: the test still crashes, even with no callback attached (i.e., no gc accounting)]
I spent some time down the rabbit hole of stack exhaustion.
My takeaway is that determinism at the edge of the stack is by chance only. On CPython, things like trivial comparisons require a stack frame on the first few executions and thereafter not. On PyPy, the first pass consumes a lot more stack frames, probably for JITting. Et cetera. I'll paste a basic test below that is flaky on both implementations, without involving hypothesis at all.
I also found that if a hypothesis test fails, the final inner-test execution has a much lower recursion limit, so there's probably a call site we've forgotten to ensure_free_stackframes
. I'll paste a test for that as well. [added note: this is when the test is executed directly by core.execute_once
instead of via ConjectureRunner.__stoppable_test_function
]
Test 1 (no hypothesis)
import gc
import pytest
@pytest.fixture(autouse=True)
def no_gc_collect():
# Ensure that there are no collections during the test
# (it makes no difference, so this is just for clarity)
gc.disable()
yield
gc.enable()
def probe_depth():
try:
return probe_depth() + 1
except RecursionError:
return 0
def test_consistent_stack_depth(): # Fails on PyPy
assert probe_depth() == probe_depth()
@pytest.mark.parametrize("_", range(5))
def test_consistent_recursionerror(_): # Fails on CPython, from the third execution
def probe_depth():
try:
return probe_depth() + 1
except RecursionError:
return 0
def at_depth(depth, fn):
if depth <= 1: # save 1 frame for fn()
return fn()
else:
return at_depth(depth - 1, fn)
def test():
# On CPython, this conditional requires an extra stack frame
# the first few executions of the test, thereafter not.
if 0 < 0:
pass
max_depth = probe_depth()
at_depth(max_depth - 1, test)
with pytest.raises(RecursionError):
at_depth(max_depth, test)
→
===================================================== short test summary info =====================================================
FAILED test_stack.py::test_consistent_recursionerror[3] - Failed: DID NOT RAISE <class 'RecursionError'>
FAILED test_stack.py::test_consistent_recursionerror[4] - Failed: DID NOT RAISE <class 'RecursionError'>
=================================================== 2 failed, 4 passed in 0.09s ===================================================
Test 2 (recursion limit changes)
import sys
import pytest
from hypothesis import given, strategies as st
def test_consistent_stack_depth():
limits = []
@given(st.booleans())
def inner(_):
limits.append(sys.getrecursionlimit())
assert False
try:
inner()
except Exception:
pass
assert all(l == limits[0] for l in limits), limits
→
===================================================== short test summary info =====================================================
FAILED test_hypothesis_stack.py::test_consistent_stack_depth - AssertionError: [2042, 2041, 2045, 1000]
======================================================== 1 failed in 0.10s ========================================================
So what does this all mean for this PR? I think it's safe, with the following remarks:
UnraisableException
problem is solved.That said, I don't want to be pushy on this subject. If we suspect it will be a maintenance headache, we should drop it. What's your gut feeling, cleanup-and-finish or drop?
If we can get to a point where the code + tests look clean and the tests pass, I'll be happy to merge it 🙂
We're not making any public-API changes, so we can always revert later if this turns out to be a bad idea.
@Zac-HD I think it's ready now.
Review guidance:
gc_cumulative_time()
, i.e. the actual accounting.Additionally,
execute_once
) with ensure_free_stackframes
, for maximal consistency. Now, it turned out this wasn't enough to remove it from conjecture/engine.py
, since the one in engine.py also protects data generation. Rather than trying to narrow down exactly where the stack is needed, I just left it in place.HYPOTHESIS_NO_TRACEBACK_TRIM
. Running in verbose mode is not a realistic option for deeply recursive strategies, but I still sometimes want to see the full stack trace. (I think I just patched it out when working on recursion earlier)Btw, I think the remaining failure on win-3.9 is unrelated and will go away by itself, but it looks slightly interesting so I'll record it for posterity
File "D:\a\hypothesis\hypothesis\hypothesis-python\tests\nocover\test_duplication.py", line 61, in test_mostly_does_not_duplicate_blocks_even_when_failing
assert set(counts.values()) == {1, 2, 3}
AssertionError: assert {1, 4} == {1, 2, 3}
New env variable (for internal use) HYPOTHESIS_NO_TRACEBACK_TRIM
Yes please! I've hacked this locally many times while tracking down internal errors.
test_mostly_does_not_duplicate_blocks_even_when_failing
undoubtedly from #3962. We can just add {1, 4}
as a possibility there. I'll do so in #4007 if you don't here
undoubtedly from #3962. We can just add
{1, 4}
as a possibility there. I'll do so in #4007 if you don't here
Thanks @tybug ! I think #4007 is the better home for it.
Avoid flaky DeadlineExceeded due to garbage collection.
See https://github.com/HypothesisWorks/hypothesis/issues/3975#issuecomment-2102995716.