HypothesisWorks / hypothesis

Hypothesis is a powerful, flexible, and easy to use library for property-based testing.
https://hypothesis.works
Other
7.39k stars 578 forks source link

Busy loop randomly runs 6x slower causing flaky DeadlineExceeded errors #4009

Closed mscheifer closed 2 weeks ago

mscheifer commented 2 weeks ago

Running a busy loop takes significantly more time for some runs leading to flaky DeadlineExceeded errors. Here is an example that is significantly reduced from the code in my actual project.

from hypothesis import given, settings
from hypothesis.strategies import sampled_from

@settings(max_examples=20, deadline=100)
@given(sampled_from(range(10)))
def test_reduced(val: int) -> None:
    for _ in range(5000):
        [str(s) for s in range(50)]

    assert val > 5

test_reduced()

Output:

  + Exception Group Traceback (most recent call last):
  |   File "/home/matthew/code/polymyth/tests/test_reduced.py", line 14, in <module>
  |     test_reduced()
  |   File "/home/matthew/code/polymyth/tests/test_reduced.py", line 6, in test_reduced
  |     @given(sampled_from(range(10)))
  |   File "/home/matthew/.cache/pypoetry/virtualenvs/non-package-mode-r4C615j_-py3.10/lib/python3.10/site-packages/hypothesis/core.py", line 1656, in wrapped_test
  |     raise the_error_hypothesis_found
  | exceptiongroup.ExceptionGroup: Hypothesis found 2 distinct failures. (2 sub-exceptions)
  +-+---------------- 1 ----------------
    | Traceback (most recent call last):
    |   File "/home/matthew/.cache/pypoetry/virtualenvs/non-package-mode-r4C615j_-py3.10/lib/python3.10/site-packages/hypothesis/core.py", line 1041, in _execute_once_for_engine
    |     result = self.execute_once(data)
    |   File "/home/matthew/.cache/pypoetry/virtualenvs/non-package-mode-r4C615j_-py3.10/lib/python3.10/site-packages/hypothesis/core.py", line 964, in execute_once
    |     result = self.test_runner(data, run)
    |   File "/home/matthew/.cache/pypoetry/virtualenvs/non-package-mode-r4C615j_-py3.10/lib/python3.10/site-packages/hypothesis/core.py", line 737, in default_executor
    |     return function(data)
    |   File "/home/matthew/.cache/pypoetry/virtualenvs/non-package-mode-r4C615j_-py3.10/lib/python3.10/site-packages/hypothesis/core.py", line 939, in run
    |     return test(*args, **kwargs)
    |   File "/home/matthew/code/polymyth/tests/test_reduced.py", line 6, in test_reduced
    |     @given(sampled_from(range(10)))
    |   File "/home/matthew/.cache/pypoetry/virtualenvs/non-package-mode-r4C615j_-py3.10/lib/python3.10/site-packages/hypothesis/core.py", line 862, in test
    |     raise DeadlineExceeded(
    | hypothesis.errors.DeadlineExceeded: Test took 239.58ms, which exceeds the deadline of 100.00ms
    | 
    | The above exception was the direct cause of the following exception:
    | 
    | Traceback (most recent call last):
    |   File "/home/matthew/.cache/pypoetry/virtualenvs/non-package-mode-r4C615j_-py3.10/lib/python3.10/site-packages/hypothesis/core.py", line 992, in execute_once
    |     raise Flaky(
    | hypothesis.errors.Flaky: Hypothesis test_reduced(val=7) produces unreliable results: Falsified on the first call but did not on a subsequent one
    | Falsifying example: test_reduced(
    |     val=7,
    | )
    | Unreliable test timings! On an initial run, this test took 239.58ms, which exceeded the deadline of 100.00ms, but on a subsequent run it took 32.53 ms, which did not. If you expect this sort of variability in your test timings, consider turning deadlines off for this test by setting deadline=None.
    +---------------- 2 ----------------
    | Traceback (most recent call last):
    |   File "/home/matthew/code/polymyth/tests/test_reduced.py", line 11, in test_reduced
    |     assert val > 5
    | AssertionError
    | Falsifying example: test_reduced(
    |     val=0,
    | )
    +------------------------------------

Running this multiple times gives fairly consistent results. 239.56ms, which exceeded the deadline of 100.00ms, but on a subsequent run it took 31.85 ms 241.02ms, which exceeded the deadline of 100.00ms, but on a subsequent run it took 33.00 ms 238.56ms, which exceeded the deadline of 100.00ms, but on a subsequent run it took 32.03 ms 241.47ms, which exceeded the deadline of 100.00ms, but on a subsequent run it took 33.48 ms

Instrumenting with cProfile like this:

from cProfile import Profile
from pstats import Stats

from hypothesis import given, settings
from hypothesis.strategies import sampled_from

@settings(max_examples=20, deadline=100)
@given(sampled_from(range(10)))
def test_reduced(val: int) -> None:
    with Profile() as pr:
        for _ in range(5000):
            [str(s) for s in range(50)]

    Stats(pr).print_stats(0.1)

    assert val > 5

test_reduced()

We get (line breaks removed):

         5002 function calls in 0.035 seconds
         5002 function calls in 0.236 seconds
         5002 function calls in 0.231 seconds
         5002 function calls in 0.232 seconds
         5002 function calls in 0.230 seconds
         5002 function calls in 0.231 seconds
         5002 function calls in 0.035 seconds
         5002 function calls in 0.035 seconds
         5002 function calls in 0.036 seconds
         5002 function calls in 0.036 seconds
         5002 function calls in 0.037 seconds
         5002 function calls in 0.036 seconds
         5002 function calls in 0.047 seconds
         5002 function calls in 0.035 seconds
         5002 function calls in 0.035 seconds
         5002 function calls in 0.035 seconds

The "lag spike" can vary in the number of runs. Sometimes it's just a single run of the test. Not 100% sure but I think the "lag spike" starts on the example immediately after the first error (whether or not the next example also fails).

I tried to reproduce this without hypothesis but sticking fairly close to what hypothesis is doing and came up with this:

from cProfile import Profile
from pstats import Stats

for a in range(10):
    try:
        with Profile() as pr:
            for _ in range(5000):
                [str(s) for s in range(50)]

        Stats(pr).sort_stats("tottime").print_stats(0.1)

        if a % 2 == 0:
            raise AssertionError()
    except AssertionError as ex:
        print(ex)

But the "lag spike" doesn't occur. All the timings are 35-36ms.

Details

Python 3.10 Hypothesis 6.103.1 (tried a couple older versions too, 6.103.0 and 6.100.0) I'm on a fairly old laptop (Core i5) so you might need to drop the deadline or increase the number of runs in the busy loop to reproduce this.

In my actual project I'm trying to test a predicate logic solver and seeing an even starker difference: 295.24ms, which exceeded the deadline of 200.00ms, but on a subsequent run it took 21.73 ms, about 14x slower.

Possible Causes

Garbage Collection

Doesn't seem like it. Adding gc.disable() at the start of the test and gc.enable() at the end has no effect. The "lag spike" is still visible in the profile stats when the profiler is in between the gc.disable() and gc.enable().

CPU Instruction cache

Maybe, but you would think just the first couple runs of the list comprehension in the busy loop would be slow but then get faster.

Hypothesis test harness code being included in the measured time

Doesn't really make sense because we see the "lag spike" in the profiler timings which are narrowed to just the busy loop.

jobh commented 2 weeks ago

After the first failure, the test is re-run under tracing to try to give more information about the failure. The busy loop is probably close to worst case for tracing performance impact.

To verify, you can add

import sys
sys.settrace(lambda *args: None)

at the top of the test. This will effectively disable hypothesis' tracing, since only one tracer can be attached.

Note that tracing is much faster in python >= 3.12.

jobh commented 2 weeks ago

And thanks for the detailed investigation! It supplied all the clues :grin:

mscheifer commented 2 weeks ago

After the first failure, the test is re-run under tracing to try to give more information about the failure. The busy loop is probably close to worst case for tracing performance impact.

To verify, you can add

import sys
sys.settrace(lambda *args: None)

at the top of the test. This will effectively disable hypothesis' tracing, since only one tracer can be attached.

Note that tracing is much faster in python >= 3.12.

Ahh yeah this is it. That does indeed make the issue go away. As does doing:

@settings(phases=[p for p in Phase if p != Phase.explain])
Zac-HD commented 2 weeks ago

This is an unfortunate experience, but I don't think we can really improve on it - any solution would also make the logic more complicated, and therefore make Hypothesis harder to understand. Only a few years until all supported Python versions are fast, though...

mscheifer commented 2 weeks ago

Is it worth adding something to the error message? Like this message Unreliable test timings! On an initial run, this test took 239.58ms, which exceeded the deadline of 100.00ms, but on a subsequent run it took 32.53 ms, which did not. If you expect this sort of variability in your test timings, consider turning deadlines off for this test by setting deadline=None. could be extended to add If you're testing CPU-bound pure-Python code, consider disabling the explain phase as it can slow down runs (or upgrade to Python 3.12). for Python < 3.12

Or add a note in the docs? Maybe here https://hypothesis.readthedocs.io/en/latest/settings.html#controlling-what-runs