Improve structure of `timezones()` strategies

This issue was prompted by @pganssle's fantastic review of #2392 (comment): between an expert commentary and my own knowledge that Ireland has a negative DST offset, I wrote a much more targeted test and exposed a problem with affected pytz timezones.

Ideally the obvious test written by a naive user would also discover such problems. #69 describes half the trick; the other half is to preferentially generate 'tricky' timezones. I think this could be as simple as changing from sampling from a flat list, to sampling from from similar subsets of available timezones. For example:

UTC and fixed UTC offsets (perhaps further divided into whole-hour and non-whole-hour)
Timezones with a single plus-one-hour DST period per year
Weird issues - negative DST offset, irregular DST start dates, things from https://github.com/HypothesisWorks/hypothesis/issues/69#issuecomment-398129658, etc.

Of course the purpose of this structure is to put a disproportionate emphasis on entries from the weird-stuff category. To make this more fun, we can't hard-code a list because timezones are subject to change at short notice and can vary between machines - so we need to compute the groups each time Hypothesis runs.

This is low on my list of priorities for now, not least because it isn't that useful without #69, but it would be nice to introduce it along with a PEP-615 timezones strategy in September :slightly_smiling_face:

BPO-40536 / https://github.com/pganssle/zoneinfo/pull/60 adds zoneinfo.available_timezones()

The zoneinfo property tests are also interesting reading.

the comparative testing thing means we'd need to do some serious contortions to have the logic described here and in #69 actually execute; probably not worth it.
@pganssle - I understand there are only about 600 keys times maybe three directories, right? For tests that take only a key I'd consider using a parametrize or (because unittest) loop-with-subtest approach rather than Hypothesis. When exhaustive testing is practical, it works great! (commenting here because the comparison is very relevant to the design of timezones strategies)

(Warning: Post contains wall of text)

@pganssle - I understand there are only about 600 keys times maybe three directories, right?

So it seems:

>>> import zoneinfo
>>> len(zoneinfo.available_timezones())
595

Most of those are not unique zones, mind you, there's a lot of aliases, and you could pretty easily detect that (on my system they're installed with hard links):

>>> import os
>>> def get_inode(key):
...     return os.stat(os.path.join(zoneinfo.TZPATH[0], key)).st_ino
... 
>>> len(set(map(get_inode, zoneinfo.available_timezones())))
388

That said, an advantage hypothesis has over exhaustive testing is that it has the test case minimizer. Luckily, the first alphabetical zone, Africa/Abidjan, happens to be super simple: one transition from LMT → GMT, then it's fixed GMT, so the minimizer has actually proven more useful than you'd think (though, admittedly, I set the order myself, so I could just loop over them in that order and fail on the first one...)

Having a reasonable (even if it's based on some heuristics) basis for minimization would be very useful.

the comparative testing thing means we'd need to do some serious contortions to have the logic described here and in #69 actually execute; probably not worth it.

I'm not sure what you mean by "the comparative testing thing", or why it would be disqualifying.

Anyway, I think it may not be so bad to use a mostly hard-coded ordering (possibly move the data for this ordering into an extras package that can be updated independently of the hypothesis code, but honestly these keys are pretty stable):

UTC: pretty sure this key will never change
Anything in Etc: This is a legacy folder that contains a bunch of fixed offsets.
Super simple zones like Africa/Abidjan (it's not hard to write a script that finds these, though it's not something I'd want to ship in a library)
Everything not in 5
Especially weird zones like Africa/Casablanca, Europe/Dublin, Europe/Lisbon. Possibly Europe/London as well (since it gets tricky around the +0/+1 stuff).

You can hard-code a list of "super simple" and "especially weird" and then filter the lists by key in zoneinfo.available_timezones().

Another option for a generated list of tricky zones is to just parse the files yourself (it's easier than it seems — I've written versions of this parser several times now; plus, it doesn't really matter if you get it wrong, since the order was essentially arbitrary before anyway).

Although zoneinfo doesn't actually expose any way to map a given IANA key to a file or resource, PEP 615 lays out the algorithm used to find it: you execute a search of zoneinfo.TZPATH, then fall back to tzdata. In the next major version of dateutil, we'll switch over to using PEP 615's logic as well (including a dependency on the tzdata module and, where possible, the same search path). Lots of information in there that can be used to infer some metric of "complexity". And if you want to be "lazy" about it and not have to write a TZif parser, there's a little trick that the eponymous Olson showed me on Twitter:

$ tail -n 1 /usr/share/zoneinfo/Australia/Canberra
AEST-10AEDT,M10.1.0,M4.1.0/3

For reasons I get into a bit in this SO answer, this isn't a perfect representation of the zone (it won't pick out things like the Europe/Lisbon DST elimination, or any of the "missing" or "doubled" days, for example), but it can provide a not-unreasonable heuristic for things like "has DST" and "has negative DST".

At some point I will have time to actually work on some of this stuff, but in the very likely event that that point is far in the future, I hope these notes are useful.

Thanks Paul! Walls of text are appropriate and useful here :smile:

Having a reasonable (even heuristic) basis for minimization would be very useful.

That's the other benefit of the proposal in this issue - with your five-part categorisation (:heart:) we'll always shrink to UTC or to a fixed offset if possible.

Concretely, I think my proposal is to write (some scripts which generate) a list of groups of keys such as Australia/Canberra - either exactly as specified above, or perhaps splitting (5) into further groups. At runtime we then iterate through zoneinfo.available_timezones(), sorting them into the appropriate sublists and having one final list for anything that isn't in our hardcoded order - custom zones, new zones, etc. Then the strategy is simply one_of([sampled_from(ls) for ls in ...]), and we're done.

HypothesisWorks / hypothesis

Improve structure of `timezones()` strategies #2414