Migrate our core representation to an IR layer

tybug commented 5 months ago

This epic-style issue tracks our work on refactoring Hypothesis to use an IR layer in our engine.

Motivation

So far, most things in Hypothesis have been built to work at the level of a bitstream.

Strategies draw bits from this bitstream to make choices or construct values while producing a return value (and in doing so "interpret the bitstream as a source of randomness", as the quote goes).
Inputs to a test function are represented internally as the bitstream that, when supplied to the test function, would generate that input.
Correspondingly, the database stores inputs as their bitstream representation.
DataTree, which tracks what inputs we have previously tried in order to avoid redundancy, works at the level of blocks — logically related continuous segments of bits, e.g. perhaps from the same strategy.
The shrinker tries to find the lexicographically ("" < "0" < "1" < "00" < "01" < "11") smallest bitstream which is still a counterexample.

However, in many cases, a bitstream is too low-level of a representation to make intelligent decisions.

For many strategies, the mapping of bitstream ↦ input is not injective, so the same input may have multiple bitstream representations. DataTree sees these as distinct inputs and can't deduplicate them. Ever wondered why we try 0 so many times for @given(st.integers())? It's not because we want to!
- This is the case for anything that requires rejection sampling. In particular, for drawing a (biased aka p ≠ 0.5) boolean, which is something we do extensively internally.
- See #1574 for a manifestation.
The shrinker has limited knowledge of the context of the bitstream it is shrinking. We do our best to give hints, for example by denoting subsets of the bitstream (called examples) as coming from a particular strategy, but it is easy for the shrinker to try inputs which are invalid and hard for the shrinker to make context sensitive shrinks.

In a completely unrelated train of thought, we would like Hypothesis to support backends: the ability to specify a custom distribution over strategies, overriding Hypothesis' pseudo-randomness. The original motivation here was supporting CrossHair (#3086), a concolic execution tool — but many other such backends are possible. (I personally have some ideas).

Happily, we can address both of these concerns with the same refactoring. That refactoring is migrating much of Hypothesis, which currently operates on bitstreams, to instead operate on an IR layer.

The Plan

The IR will be comprised of five nodes:

draw_integer
draw_float
draw_boolean
draw_string
draw_bytes

All strategies will draw from these five functions at the base level, rather than from a bitstream. From this, we get better DataTree deduplication (the mapping for arbitrary strategies is still not guarantee to be injective, but it's much closer!), more intelligent shrinking, and backend support.

To implement a backend, implement PrimitiveProvider and override each of these methods. That's it. Hypothesis will take care of the rest, including shrinking and database support.

original IR design described here https://github.com/HypothesisWorks/hypothesis/issues/3086#issuecomment-1774233444, though some small interface details have since changed.

Implementation

Completed:

initial refactorings
- 3788
- 3801
3818
3806
3899
3962
4007 (+ migrate generate_mutations_from)
4097

Ongoing work, roughly in order of expected completion:

[ ] finish migrating the shrinker
[ ] improve the representation of weights - see https://github.com/HypothesisWorks/hypothesis/pull/3929#discussion_r1529010146
[ ] migrate Optimiser (used by target())
[ ] migrate inquisitor (explain phase), see https://github.com/HypothesisWorks/hypothesis/issues/3864
[ ] migrate ParetoOptimiser
[ ] migrate database to serialized ir instead of buffers (https://github.com/HypothesisWorks/hypothesis/compare/master...Zac-HD:hypothesis:ir-serializer)

JonathanPlasse commented 5 months ago

This is super interesting! Thank you for writing this detailed issue. I would like to get involved with hypothesis. What would constitute a good first contribution here?

Zac-HD commented 5 months ago

Welcome, Jonathan! We'd love to have you continue contributing - I already really appreciate the type-annotation-improvements for our numpy and pandas extras, so this would be a third contribution 😻

@tybug might have some ideas here, but my impression is that the "refactor for an IR" project in this issue is more-or-less a serialized set of tasks and so adding a second person is unlikely to help much - even with just one we've had a few times where there were two or three PRs stacked up and accumulating merge conflicts between them.

As an alternative, https://github.com/HypothesisWorks/hypothesis/issues/3764 should be a fairly self-contained bugfix. On the more ambitious side, https://github.com/HypothesisWorks/hypothesis/issues/3914 would also benefit from ongoing work on that - testing, observability, reporting whatever bugs you surface, etc. Or of course you're welcome to work on any other open issue which appeals to you!

JonathanPlasse commented 5 months ago

Thanks, I will start with #3764 and then take on the different issue on #3914.

Zac-HD commented 5 months ago

We may still use the bitstream representation for some things (database?).

I was thinking that we'd still serialize to a bytestring - that's the ultimate interop format, and when we need to handle weird unicode and floats like subnormals or non-standard bitpatterns for nan I don't want to trust whatever database backend our users cook up to round-trip correctly. Existing formats like protobuf or msgpack all have constraints like "unicode strings must be valid utf-8" or "numbers limited to bits", so I wrote a custom serializer instead 🙂

tybug commented 5 months ago

yeah, this is a hard one to parallelize 😄. Some of the steps may subtly depend on others in ways that aren't obvious until one is knee deep in implementing it.

so I wrote a custom serializer instead 🙂

Nice! I agree with the reasoning here. Added a task for this. This probably needs to be the absolute last thing to switch to the ir.

Zac-HD commented 5 months ago

Definitely the last thing to switch, I just got nerdsniped 😅

Zac-HD commented 5 months ago

467ab23 (#3924) uses a nocover pragma to get the PR merged after I reduced our use of ParetoFront - I think it was tested mostly by accident before, but you'll have a better sense than I for where deliberate tests should go.

tybug commented 5 months ago

This should be covered by test_data_with_misaligned_ir_tree_is_invalid. I think the coverage there is just flaky because the condition is too permissive. Will address in #3923 by splitting the test

tybug commented 5 months ago

I'm working on migrating shrinker block programs. Our upweighting for large integer ranges is giving the shrinker trouble, because it means that a simpler tree can result in a longer buffer: the buffer runs through the weighted distribution and draws n bits from some small bucket, while the tree runs through the uniform distribution (as a result of forced=True) and draws m > n bits, where the difference in m and n is large enough that it offsets whatever simplification is made by the tree.

Real example of this:

b1 = b'\x01\x00\x01\x00\x00\x00\x01\x00\x01\x00\x00\x00\x01\x00\x01\x00\x00\x00\x01\x00\x01\x00\x00\x00\x01\x00\x01\x00\x00\x00\x00'
b2 = b'\x01\x00\x00\x00\x00\x00\x00\x00\x00\x01\x00\x00\x00\x00\x00\x00\x00\x00\x01\x00\x00\x00\x00\x00\x00\x00\x00\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00'
s = st.lists(st.integers(0, 2**40))

print("complex result, smaller buffer", ConjectureData.for_buffer(b1).draw(s))
# complex result, smaller buffer [0, 0, 0, 0, 0]
print("simpler result, larger buffer", ConjectureData.for_buffer(b2).draw(s))
# simpler result, larger buffer [0, 0, 0, 0]

As a result I'd like to look at moving that weighting logic into IntegerStrategy, which imo is where it logically belongs anyway, not at the ir layer. To accommodate this with weights, we'll need a structure that can express weights for entire ranges, not just "weight points and everything else is uniform". What do you think of weights=[(a, b, p), ...] where union((a, b), ...) == [min_value, max_value], sum(p) == 1, and len((a, b), ...) <= 255?

Zac-HD commented 5 months ago

What if we forced even more instead?

If we choose a smaller bits size, instead of drawing the main value from a narrower range we draw a value-to-force from the narrower range, and then force-draw it from the full range. The choice of fewer bits is then cleanly deletable without changing the interpretation of subsequent bits.

tybug commented 5 months ago

We could do that! I'm fairly confident exactly what you stated, or some small variation, would work.

I was thinking of killing two birds with one stone here, though. Do you think the upweighting belongs in the ir or in st.integers()? If we're going to move it out of the ir eventually anyway, I think now is the right time to do it, both while it's causing problems and we're changing the weights interface.

Zac-HD commented 5 months ago

I think doing it 'below' the IR, so we just represent a single integer value with a minimum of redundancy, is the principled approach here. "Literally just give me an integer" feels like it should be bijective 😅

tybug commented 5 months ago

The concern is that moving the weighting to st.integers() will result in drawing an integer correspond to more than one ir draw? I think we can avoid this via weights (and wouldn't want to move the weighting if we couldn't). I was thinking of something like this, where we combine the probability distributions upfront and pass it to weights. We wouldn't need to draw a boolean with p=7/8. Probability computations are pseudocode for whatever representation we use.

class IntegersStrategy(SearchStrategy):

    ...

    def do_draw(self, data):

        weights = None
        if self.end is not None and self.start is not None:
            bits = (self.end - self.start).bit_length()

            # For large ranges, we combine the uniform random distribution from draw_bits
            # with a weighting scheme with moderate chance.  Cutoff at 2 ** 24 so that our
            # choice of unicode characters is uniform but the 32bit distribution is not.
            if bits > 24:
                def weighted():
                    # INT_SIZES = (8, 16, 32, 64, 128)
                    # INT_SIZES_SAMPLER = Sampler((4.0, 8.0, 1.0, 1.0, 0.5), observe=False)
                    total = 4.0 + 8.0 + 1.0 + 1.0 + 0.5
                    return (
                        (4.0 / total) * (-2**8, 2**8),
                        # ...except split these into two ranges to avoid double counting bits=8
                        (8.0 / total) * (-2**16, 2**16),
                        (1.0 / total) * (-2**32, 2**32),
                        (1.0 / total) * (-2**64, 2**64),
                        (0.5 / total) * (-2**128, 2**128),
                    )
                weights = (
                    (7 / 8) * weighted()
                    + (1 / 8) * uniform()
                )

            # for bounded integers, make the near-bounds more likely
            weights = (
                weights
                + (2 / 128) * self.start
                + (1 / 64) * self.end
                + (1 / 128) * (self.start + 1)
                + (1 / 128) * (self.end - 1)
            )
            # ... also renormalize weights to p=1, or have the ir do that

        return data.draw_integer(
            min_value=self.start, max_value=self.end, weights=weights
        )

Now the ir draw_integer is truly uniform, but st.integers() keeps the same distribution as before.

Zac-HD commented 5 months ago

That would work! I'm also fine with the IR draw_integer remaining non-uniform above 24 bits, if that's easier.

tybug commented 9 hours ago

I'm working on a native ordering for the IR (wip branch). My current plan is to have a bijective map ir_ordering: (value: IRType) <-> (order: int). The order depends on the kwargs of the node and order = 0 indicates the simplest value for that node.

This will replace some ad-hoc constructs:

node.trivial becomes ir_ordering(node, to="index") == 0
all_children becomes for i in range(compute_max_children(node)): yield ir_ordering(i, to="value")

We can also take advantage of this ordering as a unified representation to work over when convenient, just like the bytestring was. I plan to use this ordering to migrate Optimiser to the IR until/if we add ir-specific mutations, and to replace our shrinker misalignment logic, which currently uses the bytestring as an intermediary.

Two things:

I'm very open to feedback on this structure or plan in general, though I'm also relatively confident in this plan in the absence of any feedback
I have stopped work here at defining the ordering on floats (and likely won't return for a few weeks). This is our opportunity to redefine the ordering on floats, free from any shrinking or byte restrictions. I think there are some things here that don't make sense:

>>> sorted([0.01 * n for n in range(100)], key=float_to_lex)
[0.0, 0.5, 0.75, 0.8300000000000001, 0.54, 0.79, 0.71, 0.96, 0.52, 0.77, 0.56, 0.81, 0.73, 0.98, 0.51, 0.76, 0.72, 0.97, 0.55, 0.8, 0.53, 0.78, 0.74, 0.99, 0.5700000000000001, 0.8200000000000001, 0.59, 0.84, 0.67, 0.92, 0.63, 0.88, 0.61, 0.86, 0.6900000000000001, 0.9400000000000001, 0.65, 0.9, 0.68, 0.93, 0.6, 0.85, 0.64, 0.89, 0.7000000000000001, 0.9500000000000001, 0.62, 0.87, 0.58, 0.66, 0.91, 0.25, 0.27, 0.48, 0.26, 0.28, 0.49, 0.38, 0.36, 0.4, 0.39, 0.37, 0.41000000000000003, 0.42, 0.46, 0.44, 0.43, 0.47000000000000003, 0.45, 0.34, 0.3, 0.32, 0.35000000000000003, 0.31, 0.29, 0.33, 0.24, 0.13, 0.14, 0.19, 0.18, 0.2, 0.21, 0.23, 0.22, 0.17, 0.15, 0.16, 0.12, 0.07, 0.09, 0.1, 0.11, 0.08, 0.06, 0.05, 0.04, 0.03, 0.02, 0.01]
>>>

like 0.54 < 0.1, and all of the last ~10 values not being smaller.

I guess a starting point is, how should we order [0, 1)? We don't have to go crazy with this – the ordering only has an impact insofar as the shrinker can intelligently match it[^1], there are just too many floats – but @Zac-HD I'm curious if you've thought about a good ordering before 🙂. I'm tempted to say [0] + [0.1 * n for n in range(10)] + [0.05 * n for n in range(20)] + [0.025 * n for n in range(40)] + ... (ignoring duplicates), but I don't know how this would hold up against exponent/mantissa realities of floats.

[^1]: I'm realizing that our ordering and shrinker are strongly decoupled in the IR, and to see benefits both need to be updated. There's no point to defining an intelligent and complicated ordering on floats if the shrinker never tries (ordering-)smaller floats.

HypothesisWorks / hypothesis