HypothesisWorks / hypothesis

Hypothesis is a powerful, flexible, and easy to use library for property-based testing.
https://hypothesis.works
Other
7.57k stars 587 forks source link

Input domain space coverage checking #2268

Closed maxrothman closed 4 years ago

maxrothman commented 4 years ago

Is there any equivalent in Hypothesis to Quickcheck's classify and cover features? They're helpful in ensuring that functions with special behavior for small regions of the input domain are well-tested. I found @example in the docs, but that only allows me to check a specific point within the domain-space, not a region.

For the uninitiated, this article (from the linked heading) contains a good example of why these features are useful.

rsokl commented 4 years ago

Based on a quick glance at those two Quickcheck features, I believe that hypothesis.event and hypothesis.assume are roughly the analogs that your are looking for.

maxrothman commented 4 years ago

Thanks for the quick response! Indeed hypothesis.event is sufficient to implement Quickcheck's classify, though the statistics output will be mixed with other event output, which is less than ideal. I don't think cover and assume are the same though.

Suppose a particular property used hypothesis.strategies.text and I wanted to ensure both very short and very long strings were tested, in addition to the normal random testing. With Quickcheck I could do the following:

some_test = do
  cover (length s > 1000) 10 .
  cover (length s < 10) 10 $
  functionUnderTest s

Now, Quickcheck will make sure that 10% of the test cases have a length of more than 1000 and 10% have a length of less than 10, leaving 80% having some length in the middle.

Unless I'm missing something, there's not a good way to do this using assume.

rsokl commented 4 years ago

Ah yes, you are correct that assume would not afford you that capability. To my knowledge, Hypothesis does not expose any functionality that allows you to emulate the behavior of cover.

I think that this would capability would most easily be added by modifying Hypothesis' event mechanism and adding a new health check. This health check would fail if an event occurs less frequently than what the user specifies.

Other Hypothesis devs would need to weigh in on whether or not this feature fits within the purview of library's design. However, I suspect that such functionality may be at odds with the following guidance about how Hypothesis generates data:

It is better to think about the data Hypothesis generates as being arbitrary, rather than random. We deliberately generate any valid data that seems likely to cause errors, so you shouldn’t rely on any expected distribution of or relationships between generated data. You can read about “swarm testing” and “coverage guided fuzzing” if you’re interested, because you don’t need to know for Hypothesis!

Zac-HD commented 4 years ago

My advice would be that if you want to guarantee that some corner of the input space is covered, you should write a test just for that.

(a pattern where you use st.data() to draw from a strategy provided via pytest.mark.parametrize makes it easy to share code)

Other mechanisms are likely to be fragile as we refactor the internals, so we would prefer not to maintain them... and IMO quickcheck's cover is a bit of a hack for generation anyway, I'd prefer something like Swarm Testing which doesn't rely on the user tuning it correctly.

maxrothman commented 4 years ago

I'm not familiar with swarm testing or coverage-guided fuzzing, so please excuse my ignorance. Do you recommend any articles or talks on the subject?

So then the recommended approach is to run the same test with multiple differently-parametrized generators? Using pytest.mark.parametrize to avoid duplicating the test itself is a good tip.

FWIW, Quickcheck appears to be doubling down on the use of labels and cover, this talk describes the approach in detail with great examples, and shows how Quickcheck figures out how many tests it needs to run to reach statistical confidence about the specified coverage ratios.

Zac-HD commented 4 years ago

I'm not familiar with swarm testing or coverage-guided fuzzing, so please excuse my ignorance. Do you recommend any articles or talks on the subject?

Swarm Testing: https://agroce.github.io/issta12.pdf ; coverage-guided fuzzing read https://danluu.com/testing/ and then the linked notes on AFL. (and see https://xkcd.com/1053/ 😄)

So then the recommended approach is to run the same test with multiple differently-parametrized generators?

If you have specific small areas of the input space you must hit, yes. If you just want to increase the probability you could one_of(...) a main strategy with one for the smaller area.

FWIW, Quickcheck appears to be doubling down on the use of labels and cover, this talk describes the approach in detail with great examples, and shows how Quickcheck figures out how many tests it needs to run to reach statistical confidence about the specified coverage ratios.

Hmm... I think this is basically prompted by it being painful to write Quickcheck tests (you need a generator, and a newtype, and a validator, and...) - our solution is basically to make writing tests easier (you only need a strategy, and that's as easy as possible).

I could see a "run until you have x examples of this label, y examples of that one, ..." mode being useful. Adaptive sampling is awesome, but I don't see why you'd be after a proportion rather than an absolute number here. For Hypothesis I suspect this would be a net loss by complicating our API significantly but only occasionally being useful.

DRMacIver commented 4 years ago

Actually I've been thinking since almost the beginning that it would be good to have something like QuickCheck's cover, it's just always been fairly low priority. Now would probably be a good time to think about adding it because it plays well with the new swarm testing and targeted testing features. Additionally, it would complement any future plans we have to reintroduce actual coverage based testing nicely because we could use exactly the same logic for guiding the generation process.

I'm less keen on adding classify at this point - it's less useful, partially overlaps with event, and really is something we should only think about adding once we've sorted out our reporting UI situation. I'm keen not to clutter up the command line output and UI with more information.

The big difficulty with adding cover is that it's intrinsically the bad kind of randomness because it introduces false positives into your test failures - @rsokl's suggestion of integrating it into the health check mechanism is probably a good one to offset that, and I can think of a couple of design features that might help.

DRMacIver commented 4 years ago

I also agree with what @Zac-HD said that usually you're going to be better off writing multiple tests with more specific generators BTW (and that this is much easier in Hypothesis than in QuickCheck). Using classify is more or less intrinsically a type of rejection sampling, and while I've some ideas to make rejection sampling magically clever that may be appearing in Hypothesis at some point, it's always going to be better to generate the right thing by construction rather than by rejection - rejection sampling will tend to bias your test cases in weird ways.

maxrothman commented 4 years ago

Thanks for the resources @Zac-HD, I'll give those a look.

I'm so glad to have sparked what has turned out to be a interesting discussion. Hooray for open source!

The big difficulty with adding cover is that it's intrinsically the bad kind of randomness because it introduces false positives into your test failures

FWIW it seems that Quickcheck has worked around this issue by calculating how many tests they need to run to gain statistical confidence that positives are not in fact false (see checkCoverage for details).

Using classify is more or less intrinsically a type of rejection sampling, and while I've some ideas to make rejection sampling magically clever that may be appearing in Hypothesis at some point, it's always going to be better to generate the right thing by construction rather than by rejection - rejection sampling will tend to bias your test cases in weird ways.

This makes sense to me. Perhaps it'd make more sense to give this information to the generator, and have generators that are capable of creating data that models a particular distribution.

Zac-HD commented 4 years ago

Quickcheck has worked around this issue by calculating how many tests they need to run to gain statistical confidence that positives are not in fact false

The problem with that is you can end up running literally hundreds of times more tests than expected, which is a qualitative performance problem 😐

Zac-HD commented 4 years ago

Closing this issue as I don't think there's any specific action we need to take.

DRMacIver commented 4 years ago

FWIW it seems that Quickcheck has worked around this issue by calculating how many tests they need to run to gain statistical confidence that positives are not in fact false (see checkCoverage for details).

Yeah, but there are a couple of difficulties with implementing this in Hypothesis. Main ones:

  1. We have fewer entry points than QuickCheck, which seems happy to just add a proliferation of different ones for different ways of running tests. We've just got @given, which is helpful for test runner integration and not overwhelming users with a battery of choices, and one of the promises that @given makes is that it won't run more than max_examples test cases.
  2. QuickCheck just runs lots of random samples which are all independent of each other, while Hypothesis has a much more concrete representation of the test case and makes decisions about what to do based on what has happened so far. For example, if we did implement a feature like this, it would make sense for Hypothesis to actively try to generate examples that hit a coverage target. This means that the specific test that QuickCheck uses is invalid, because we don't have independent samples.

The second point in particular makes a direct port of the feature kinda meaningless - we could still have it be an error if a purely random generation of the test satisfied the categorisation, but it's not clear why that would be a sensible thing to do if the tests you're actually seeing aren't actually generated that way.