HypothesisWorks / hypothesis

Hypothesis is a powerful, flexible, and easy to use library for property-based testing.
https://hypothesis.works
Other
7.58k stars 586 forks source link

Should it be possible to reuse `given` data between tests? #114

Closed radix closed 6 years ago

radix commented 9 years ago

Say I have a very complicated object to generate (some game world) and dozens of small tests that work on a game world. The vast majority of the time is spent in generating worlds for the tests and very little time is spent in the tests. I can make them run a lot faster by doing something like

@given(game_worlds)
def test_all(world):
    _test_a(world)
    _test_b(world)
    ....

but this has downsides: my unittest runner gives me less useful output, and Hypothesis can't keep track of which falsifying example is associated with which individual test.

It'd be nice if I could share a world to be tested by many different properties, while still having the benefit of per-test falsifying examples stored in the database. Maybe this has problems that make it infeasible or otherwise a bad idea.

Perhaps something like:

shared_game_worlds = game_worlds.shared()

@given(shared_game_worlds)
def test_a(world): ...

@given(shared_game_worlds)
def test_b(world): ...

Thank you for Hypothesis :)

DRMacIver commented 9 years ago

Doing something along these lines definitely seems reasonable to support. I'm not totally sold on the specific suggestion of doing it at the strategy level (it seems harder than some of the alternatives), but it's definitely neither feasible nor a bad idea. The easiest version is probably to give you an "always save examples which satisfy assumptions in the database" mode to run in.

That being said, I doubt I'll do it soon, so some specific tips for your situation:

Firstly, tests written using Hypothesis are typically a bit longer than their corresponding quickcheck implementations. I'm not sure why but suspect some combo of Python testing culture, a tendency to more non-uniform interfaces to test, and it looking more natural in the language. So you might find it productive to combine some of your tests.

Secondly, in general strategies built up through the standard combinators shouldn't be too slow (this isn't really true in that there are a lot of things you can do that end up asking Hypothesis for huge data). So it's quite possible that you've just hit a performance bug. If you show me the strategy you're using (I think I saw an earlier version of it) I can take a look. The most likely culprit is that if you have nested collections you may wish to turn down the size of the interior values (either with an average_size or max_size parameter).

DRMacIver commented 9 years ago

http://www.drmaciver.com/2015/07/notes-on-hypothesis-performance-tuning/ may also be useful for you but it's mostly targeted at people where strategy performance isn't the problem.

radix commented 9 years ago

I'm not totally sold on the specific suggestion of doing it at the strategy level (it seems harder than some of the alternatives)

Yeah, it occurred to me that the only way I can think of implementing .shared() would be to generate all the examples ahead of time and return them encoded in the result of shared, since hypothesis can't control the order in which tests are run. This basically means keeping them in memory forever. Maybe that's okay for some use cases, but I guess a lot of people who have problems with the time it takes to generate objects are probably generating big objects.

The easiest version is probably to give you an "always save examples which satisfy assumptions in the database" mode to run in.

I don't really know how the database works, but that sounds like it's probably a better idea.

Firstly, tests written using Hypothesis are typically a bit longer than their corresponding quickcheck implementations. I'm not sure why but suspect some combo of Python testing culture, a tendency to more non-uniform interfaces to test, and it looking more natural in the language. So you might find it productive to combine some of your tests.

Yeah, I generally dislike the large size of typical unit tests in Python ;-)

Secondly, in general strategies built up through the standard combinators shouldn't be too slow (this isn't really true in that there are a lot of things you can do that end up asking Hypothesis for huge data). So it's quite possible that you've just hit a performance bug.

Actually I think that most likely it's because the objects I'm building up are lots and lots of pyrsistent objects, which are fairly fast but not as fast as built-in data structures, of course. I think that's probably a large part of why things are pretty slow.

If you show me the strategy you're using (I think I saw an earlier version of it) I can take a look. The most likely culprit is that if you have nested collections you may wish to turn down the size of the interior values (either with an average_size or max_size parameter).

Here's the strategy:

https://gist.github.com/radix/1124c4c60e4616b791da

  1. I realized while putting that gist together that the way I'm building locations is (still) dumb, since it can accidentally create a location with the same name twice. Since later they're put into a dict based on name, that means I'm basically doing some work that is going to overwritten if some location names clash.
  2. I set average and max sizes for collections -- mostly just so the examples were tractable, since trying to debug my game states was becoming pretty difficult with the tons and tons of goop I was getting printed out :)
  3. if you have any other feedback about improving those strategies I would massively appreciate hearing it.

Just as a little background: I'm not personally blocked by any performance issues with strategy generation; this kind of example-sharing just seemed like a good idea, since it seems like there's not a lot of benefit in generating different samples for different tests (at least in this kind of scenario). This code I'm working on is actually for a screencast I'm creating on Functional Programming in Python for O'Reilly, where the running example is a trivial text adventure game. So this isn't a serious app, just an educational example.

DRMacIver commented 9 years ago

OK. So my first specific suggestion is that you're using flatmap a lot. This is probably the source of a lot of your performance issues. Its performance isn't terrible but it will typically be significantly worse than trying to produce the same thing without flatmap.

One specific instance of this is that you're using s.flatmap(lambda x: just(f(x))) a lot. This is the same as s.map(f), which should work a lot better.

As mentioned on IRC, you definitely shouldn't be doing just(choice(...))). This is basically guaranteed to cause bugs. That's what sampled_from is for.

radix commented 9 years ago

@DRMacIver thanks, I noticed flatmap in the docs before I saw map and so I never even realized it existed (my fault; they're in the right order in the docs, I was just jumping around a lot while I was trying to learn stuff). I've also switched to sampled_from. thanks. :)

thedrow commented 9 years ago

If you use pytest you can just share the strategy with a fixture if we'll add support for it in the plugin.

DRMacIver commented 9 years ago

That won't do anything useful. Strategies aren't stateful in that way. You can also just share the strategy as a top level global value.

thijsvandien commented 7 years ago

@DRMacIver, I've sent you an email on May 20 about potentially funding this issue.

Zac-HD commented 6 years ago

It'd be nice if I could share a world to be tested by many different properties, while still having the benefit of per-test falsifying examples stored in the database. Maybe this has problems that make it infeasible or otherwise a bad idea.

Since this issue was opened, the Hypothesis' internal model has changed substantially.

Now, there are basically two approaches to this, with distinct problems:

  1. Store the values generated by the strategy, and reuse those. Unfortunately this is not possible in the general case, as the possibility of mutation would break all our core invariants.
  2. Store the example buffer, and reuse that for tests using the same strategy. This would work, but save very little time - almost all the time taken to generate data is converting from the internal buffer to an actual value. In particular, this would not fix the issue above and storing it in the database is often slower than generating afresh.

So while it's technically possible for at least some cases, it would not actually help performance :disappointed:
(upside: Hypothesis is also faster than it used to be, so there's less need for it :smile:)

thijsvandien commented 6 years ago
  1. Store the values generated by the strategy, and reuse those. Unfortunately this is not possible in the general case, as the possibility of mutation would break all our core invariants.

Would it be too expensive for Hypothesis to check if any mutation took place (with some hashing I suppose) and automatically regenerate the examples only if needed?

Zac-HD commented 6 years ago

Between unhashable types and custom implementation of __hash__ I wouldn't like to try!

Personally, I'd do some combination of more hardware, fewer examples per run, and consolidating tests (as they should all pass normally, there's no disadvantage to this unless they fail).

For more specific advice, David and I both offer Hypothesis-related consulting for non-open-source projects; so feel free to email us with more detail about your problem.