Generating test data without using @given decorator

karlicoss commented 10 months ago

What I want to achieve: I'm trying to use hypothesis to generate large amounts of randomized test data -- I'm not trying to use it for tests, just want to use in a script. I found out that I can use .example method from a strategy to achieve data generation. I intentionally simplified my usecase, so let's say we want to generate 1000 integers:

TOTAL = 1000
minint = 0
maxint = 2 ** 31

from hypothesis.strategies import lists, integers
gen = lists(integers(min_value=minint, max_value=maxint), min_size=TOTAL, max_size=TOTAL)
ints = gen.example()
assert len(ints) == TOTAL  # just to check

This works, however I have two issues

it takes noticeable time to run (about 10 seconds). If I use custom code with random.Random.randint to generate 1000 integers, it completes instantly, as expected. If I use hypothesis via @given, defining the test, etc -- it also works instantly. But I don't really understand why is there such a performance difference?
I couldn't find a way to force it to use a fixed random seed (this makes sense in my case as I am interested in data generation rather than fuzzing/finding minimal failing example). I tried using register_random, but it had no effect

So the questions are:

why is this bit of code so slow? I looked in code and it seems that there could be some overhead due to extra filtering etc (even if they aren't defined like in my case), but I wouldn't expect it to be that slow
is it a completely wrong way to use Hypothesis? Feels like it could be useful to benefit from hypothesis machinery for data generation without necessarily using decorators, etc.

Apologies if it's not the best forum to ask -- I did read the docs and searched through the source code but couldn't really figure this out. Thanks!

Zac-HD commented 10 months ago

@given() is the only way to draw data from strategies - the .example() method just wraps that up for you internally! Supporting meaningfully different interfaces just isn't technically feasible with our limited volunteer time 🙁

For determinism and number of examples, you'll want to use @settings(max_examples=..., derandomize=True).

It's slower than plain random.randint() because we're doing much more under the hood which is useful in testing. If your data is simple that's probably a poor tradeoff; if it's complex then the convenient API probably wins out and the performance gap will be smaller.

Finally, I'll note that Hypothesis' data is draw from a really weird distribution, full of edge cases and weird correlations. That's great for finding bugs, but may or may not be what you want here - if not, I've heard good things about the mimesis library for non-testing usecases (but not used it myself). I hope that helps!

karlicoss commented 10 months ago

Thanks for such a quick response, this helps!

tybug commented 10 months ago

just to answer a concrete question....example() in your case is slow because it is generating and caching 100 examples ahead of time, not just one: https://github.com/HypothesisWorks/hypothesis/blob/226268c9acccc68de89308741151116c9c899256/hypothesis-python/src/hypothesis/strategies/_internal/strategies.py#L327-L340

HypothesisWorks / hypothesis

Generating test data without using @given decorator #3790