Closed mscheifer closed 2 weeks ago
After the first failure, the test is re-run under tracing to try to give more information about the failure. The busy loop is probably close to worst case for tracing performance impact.
To verify, you can add
import sys
sys.settrace(lambda *args: None)
at the top of the test. This will effectively disable hypothesis' tracing, since only one tracer can be attached.
Note that tracing is much faster in python >= 3.12.
And thanks for the detailed investigation! It supplied all the clues :grin:
After the first failure, the test is re-run under tracing to try to give more information about the failure. The busy loop is probably close to worst case for tracing performance impact.
To verify, you can add
import sys sys.settrace(lambda *args: None)
at the top of the test. This will effectively disable hypothesis' tracing, since only one tracer can be attached.
Note that tracing is much faster in python >= 3.12.
Ahh yeah this is it. That does indeed make the issue go away. As does doing:
@settings(phases=[p for p in Phase if p != Phase.explain])
This is an unfortunate experience, but I don't think we can really improve on it - any solution would also make the logic more complicated, and therefore make Hypothesis harder to understand. Only a few years until all supported Python versions are fast, though...
Is it worth adding something to the error message? Like this message
Unreliable test timings! On an initial run, this test took 239.58ms, which exceeded the deadline of 100.00ms, but on a subsequent run it took 32.53 ms, which did not. If you expect this sort of variability in your test timings, consider turning deadlines off for this test by setting deadline=None.
could be extended to add
If you're testing CPU-bound pure-Python code, consider disabling the explain phase as it can slow down runs (or upgrade to Python 3.12).
for Python < 3.12
Or add a note in the docs? Maybe here https://hypothesis.readthedocs.io/en/latest/settings.html#controlling-what-runs
Running a busy loop takes significantly more time for some runs leading to flaky
DeadlineExceeded
errors. Here is an example that is significantly reduced from the code in my actual project.Output:
Running this multiple times gives fairly consistent results.
239.56ms, which exceeded the deadline of 100.00ms, but on a subsequent run it took 31.85 ms
241.02ms, which exceeded the deadline of 100.00ms, but on a subsequent run it took 33.00 ms
238.56ms, which exceeded the deadline of 100.00ms, but on a subsequent run it took 32.03 ms
241.47ms, which exceeded the deadline of 100.00ms, but on a subsequent run it took 33.48 ms
Instrumenting with
cProfile
like this:We get (line breaks removed):
The "lag spike" can vary in the number of runs. Sometimes it's just a single run of the test. Not 100% sure but I think the "lag spike" starts on the example immediately after the first error (whether or not the next example also fails).
I tried to reproduce this without
hypothesis
but sticking fairly close to whathypothesis
is doing and came up with this:But the "lag spike" doesn't occur. All the timings are 35-36ms.
Details
Python 3.10 Hypothesis 6.103.1 (tried a couple older versions too, 6.103.0 and 6.100.0) I'm on a fairly old laptop (Core i5) so you might need to drop the deadline or increase the number of runs in the busy loop to reproduce this.
In my actual project I'm trying to test a predicate logic solver and seeing an even starker difference:
295.24ms, which exceeded the deadline of 200.00ms, but on a subsequent run it took 21.73 ms
, about 14x slower.Possible Causes
Garbage Collection
Doesn't seem like it. Adding
gc.disable()
at the start of the test andgc.enable()
at the end has no effect. The "lag spike" is still visible in the profile stats when the profiler is in between thegc.disable()
andgc.enable()
.CPU Instruction cache
Maybe, but you would think just the first couple runs of the list comprehension in the busy loop would be slow but then get faster.
Hypothesis test harness code being included in the measured time
Doesn't really make sense because we see the "lag spike" in the profiler timings which are narrowed to just the busy loop.