datetimes() strategy feature request: Parameter(s) to control granularity?

PiDelport commented 6 years ago

Sometimes, you're only interested in datetimes up to a given granularity, such as seconds, minutes, or hours, rather than milliseconds.

Currently, this can be done by truncating the granularity of the generated values, using something like:

datetimes().map(lambda dt: dt.replace(microsecond=0))

However, it would be nicer and probably more efficient if the strategy had a parameter that makes it easy to specify what level of granularity is desired.

Zac-HD commented 6 years ago

What's the use-case for this? I can see that controlling the granularity could be useful, but I also really like the simplicity of the current API and the map-replace tactic will actually work really well :smile:

Currently I'm inclined to just document the replacement trick and that this is a nice and efficient way to do it.

PiDelport commented 6 years ago

@Zac-HD: The map-replace trick gets unwieldy, for example, if you want hour granularity, you're already up todatetimes().map(lambda dt: dt.replace(minute=0, second=0, microsecond=0)).

Something akin to datetimes(granularity=timedelta(hours=1)) would be quite a bit nicer and more obvious, I think. It would also let you easily specify granularities such as 30 or 15 minutes (for systems that operate with those kinds of expected inputs), which you cannot easily do with the replace() hack.

Also, doesn't it interfere with generation / shrinking efficiency when increasingly bigger parts of the input space all get mapped to the same output value?

Zac-HD commented 6 years ago

Makes sense! I've labelled this as a good issue, because it's reasonably self-contained and doesn't require deep knowledge of internals. Returning contributions also welcome of course if you'd like to give it a go :smile:

Something akin to datetimes(granularity=timedelta(hours=1)) would be quite a bit nicer and more obvious, I think. It would also let you easily specify granularities such as 30 or 15 minutes (for systems that operate with those kinds of expected inputs), which you cannot easily do with the replace() hack.

I do like using a timedelta for the granularity, but... what should we do if passed timedelta(microseconds=VERY_LARGE_PRIME)?

My best idea so far would be to insist that granularity must have units of hours, minutes, seconds, or microseconds; and that the value must be an integer factor of the next unit up. This will be fiddly to implement, but not especially difficult.

Doesn't it interfere with generation / shrinking efficiency when increasingly bigger parts of the input space all get mapped to the same output value?

Strictly speaking yes, but doesn't matter much while generating data (est. <1% perf impact) and consistently irrelevant parts of the input data get optimised out really quickly when shrinking. Generally I would optimise for readability over the small performance gain in this kind of case.

PiDelport commented 6 years ago

what should we do if passed timedelta(microseconds=VERY_LARGE_PRIME)?

I don't think this should affect much, at least with the most naive strategy I can think of: something like _base_datetime + granularity * randominteger, where _randominteger is bounded such that the result falls in the min_value / max_value range?

Zac-HD commented 6 years ago

We can certainly calculate an output under those conditions; I'm just concerned that it might surprise users - and producing the expected output is pretty important to us!

For example: should granularity=timedelta(minutes=17) be allowed to produce the time 1:08am? What about 3:52pm? or 10:14am? Confusing! (to demonstrate: pick the odd time out without a calculator...)

We also draw each part of a datetime seperately rather than treating them as offsets from a baseline, to get nicer shrinking behaviour, so the implementation is a lot cleaner with the more restrictive rule.

PiDelport commented 6 years ago

For granularities like that, letting the user customise the base time should be a simple and sensible thing to do, I guess?

Zac-HD commented 6 years ago

Not keen to add another argument, and still risks confusion - and it's much easier to relax the rules later than to tighten them. Do you have a concrete use-case that my restricted proposal wouldn't support?

PiDelport commented 6 years ago

I'm not sure… I'm not confident I understand how that proposal would work in implementation or behaviour, while the base + offset approach seems simpler to me, both in implementation and behaviour (which matters to me from a user perspective).

Maybe it's simpler to have this as an alternative datetime strategy, rather than trying to munge it into the current datetimes one?

bukzor commented 6 years ago

I would suggest a cron-style configuration: datetimes(hours=(0,12,23), microseconds=0). I think this is simpler, will provide less surprising results.

Perhaps this would be shorthand for (or should instead be designed as?) datetimes(hours=sampled_from((0, 12, 23)), microseconds=just(0)).

Zac-HD commented 6 years ago

The alternative datetimes strategy already exists - it's spelled builds(datetime.datetime, **kwargs)!

Ultimately, I just think that the reduction in API clarity (due to the explosion of options) outweighs the gains in expressive power from the implementation, when .map or builds() allow users to construct the specific strategy that they want.

bukzor commented 6 years ago

I didn't think of that... although in hindsight it's obviously true. Perhaps the best thing would be a Recipes section ala Itertools Recipes.

HypothesisWorks / hypothesis

datetimes() strategy feature request: Parameter(s) to control granularity? #1203