HypothesisWorks / hypothesis

Hypothesis is a powerful, flexible, and easy to use library for property-based testing.
https://hypothesis.works
Other
7.39k stars 578 forks source link

Improve caching for misaligned `ConjectureData`s #3977

Closed tybug closed 1 month ago

tybug commented 1 month ago

I was not expecting to go this deep into the caching logic, but it became more and more clear to me while working on #3962 that our internal mark_invalid for misaligned datas actually requires some more thought, and we were only getting away with it before by luck.

The issue goes something like this. Say a test function draws an integer then a boolean. That means a data with an ir prefix of [0, True] is valid and a data of [0, 2] is misaligned. The latter hits an internal mark_invalid and has Status.INVALID. Now, how should [0, 2] be cached? Its resulting ir tree is just [0], since it was invalid "between" 0 and 2 and we don't know what value the test function would have drawn in this scenario. But mapping [0]: invalid in our cache would be incorrect! In this case [0] is actually an overrun because we draw twice. Moreoever, some extensions of [0] like [0, True] are valid while others are not.

A separate problem is how do we get nice caching behavior out of DataTree.simulate_test_function? We would like extensions of [0, 2] to be simulated to [0, 2] as invalid. But DataTree has a fundamental invariant that splitting at differing kwargs/ir types is flaky, and you can only split at differing values. So we cannot store [0, 2] in the standard way, as it would be immediately rejected as flaky alongside [0, True].

This pull special-cases the "invalid because misaligned" caching logic to include the node in the prefix we were invalid at, such that we cache [0, 2]: invalid. It also adds similar special logic for DataTree such that simulate_test_function will conclude_test(Status.INVALID) in all the right places. This already happened in practice for misaligned datas when drawing, except in the case of a null transition, where we need to probe one draw further to see if we would have been invalid.

This pull does not take the extra step of also caching/simulating [0, 1] to invalid, given [0, 2] is invalid. In practice this is possible: if [0, 2] is invalid because 2 has the wrong ir type, then any value at i=1 which differs in ir type can also be simulated to invalid without calling through to the test function. I think we could revisit this later if need be.

I'm not enormously happy with this solution, fwiw. Would love to go with a more principled solution if one exists.

tybug commented 1 month ago

TBH I disagree with RUF005 as a rule in general, but I've complied with it here.