Hypothesis not finding failing examples

asmeurer commented 4 years ago

At https://github.com/Quansight/ndindex/pull/76 (particularly the most recent commit, https://github.com/Quansight/ndindex/pull/76/commits/2006911d291f5af0681b64bd9ba6b96eed5d5eec), I implemented the logic for as_subindex wrong. But the hypothesis test for as_subindex just isn't finding a failing example. I even ran it with --max-examples=1000000, which took 2 hours to run, and nothing.

I tried reducing the broad ndindices strategies to strategies that I knew would encompass the bug. I got nothing until I narrowed it down to just

shapes = tuples(integers(0, 10)).filter(
             # numpy gives errors with empty arrays with large shapes.
             # See https://github.com/numpy/numpy/issues/15753
             lambda shape: prod([i for i in shape if i]) < 1000)

_integer_arrays = arrays(intp, shapes, elements=integers(0, 10))

Tuples_of_slices = tuples(one_of(
    slices(),
), min_size=2).filter(_doesnt_raise)

Tuples_of_integerarrays = tuples(one_of(
    _integer_arrays,
), min_size=2, max_size=32).filter(_doesnt_raise)

@given(Tuples_of_slices, Tuples_of_integerarrays, shapes)
def test_as_subindex_hypothesis(idx1, idx2, shape):
    ...

(compare that to the default ndindices and shapes strategies https://github.com/asmeurer/ndindex/blob/2006911d291f5af0681b64bd9ba6b96eed5d5eec/ndindex/tests/helpers.py). And even then I still had to run with --max-examples=1000.

A shrunk example that fails is

    idx1=(slice(None, 1, None), slice(None, 1, None))
    idx2=(array([0]), array([0]))
    shape=(1, 1)

@Zac-HD do you have any suggestions on how I can improve the situation here? I understand that the search space for my strategies is pretty big, but I would expect to at least get something useful out of it after running for a few hours. By the way, if you want to run the above test, you may have to install NumPy from git (sorry).

Is there maybe some way to tell hypothesis that certain things are more important than others? When I look at the verbose output, it seems to be trying a lot of things that aren't really going to make a difference for the most part, like very large integers. It also really likes producing empty tuples, and those are good to test, but they generally represent trivial cases and in a lot of instances I if an empty tuple is a problem I would expect a larger example to shrink to it. I'm OK with it testing these things, but watching the verbose output, it feels like it care about them a little bit too much. Or should I not focus too much on what hypothesis is doing internally?

I know I can add @examples, but that seems a lot like I would just be fooling myself into thinking hypothesis is useful, because the @examples don't actually influence the example generation from the strategies (they also don't shrink, which is kind of annoying). So an @example might cover a case I know to be wrong, but hypothesis wouldn't actually ever find cases that I don't know about.

Zac-HD commented 4 years ago

Hmm. Unfortunately I don't have any silver-bullet style advice; I think this is just a case where failing examples are a very small fraction of your input space.

If you can work out how to summarise "goodness" of an example as a real number (or several), hypothesis.target() can help. Usually I'd also be excited about some coverage-driven stuff I'm working on, but I don't think it would actually help much in this case.

Otherwise... you can probably make examples somewhat more diverse by complicating your strategies. This usually isn't worth it, but does kinda work - for example using s | s.filter(not an empty tuple) | <s but with min_size=2> to boost the chance that s generates non-empty tuples. In general this is both harder and less useful than it seems though, Hypothesis' heuristics are already pretty good.

asmeurer commented 4 years ago

Thanks for responding.

Actually, now that I think about it, the problem might not be the empty tuples. I think the issue is that I have a lot of different cases for this particular function that raise NotImplementedError. I currently just ignore NotImplementedError in the test, so that it can "automatically" work when it becomes implemented. I actually used to assume(False) for NotImplementedError, but I removed it when hypothesis told me there were too many filtered tests. But perhaps I should have heeded that warning. I'm pretty sure once I implement more cases, then the current error that I wasn't finding would be found much easier. Maybe I should make a separate test with a more restrictive set of strategies, which I know will be implemented. Is there a way to decorate the same test twice to generate two different tests?

Another question, which I think will actually be a bigger problem going forward. Is there some way to tell hypothesis that two strategies should be "mostly" related to one another in some way? To be more concrete, if you have a boolean array index idx and index an array a with a given shape shape, a[idx] will be an index error unless the shape of idx matches shape. In general, idx might be part of a tuple index, so it should match shape at the right place. I want to test the IndexError case, but in the typical case, the shape generated for idx should be related to shape. idx and shape would be generated from different strategies, like

# The actual strategies used in ndindex are a little bit more complicated than this

shapes = tuples(integers(0, 10))
boolean_arrays = arrays(bool_, shapes)

@given(boolean_arrays, shapes)
def test_booleanarray(idx, shape):
    ...

Maybe I can use target to come up with a rough measure of how close idx.shape is to shape. Are there any other clever ways to do this?

asmeurer commented 4 years ago

I played with using target here https://github.com/Quansight/ndindex/pull/78. You can run the tests with -s to see the target values. I've added an intentional bug to see if it can find it. I first tried checking the ratio of boolean array shapes that matched a subset of the index array shapes, but that seemed to be too fine grained. So instead I made the target negative if the boolean array mismatches (by how much it mismatches), and positive if it does, by its shape. I tried giving more score to arrays with more dimensions and more arrays, to get those cases to appear more often.

I've had mixed results. I eventually did get it to find the intentional bug, but only after many examples. It isn't really clear to me if the targeting made it find the bug any faster (and I'm not sure how to test that, especially now that the failure is in the database). Based on the output of my print statements, it doesn't really seem to be smart enough to increase the target values.

Zac-HD commented 4 years ago

For related shapes, you probably want some combination of st.shared(), npst.broadcastable_shapes(), and npst.mutually_broadcastable_shapes().

Yeah... targeting does tend to increase the score pretty well, but it only kicks in halfway through and arrays are a hard structure for it. Coming up with a metric that correlates well with bugs is often a lot harder than it sounds too... As to empirically checking the impact though, you can run a number of tests with @settings(database=None) (or use the phases setting to skip Phase.reuse) and --hypothesis-seed= 0, 1, 2, 3... to get reproducible runs, plus --hypothesis-show-statistics to see how long it took to find the bug. It's a manual process, but it does work eventually!

asmeurer commented 4 years ago

Nice. shared is what I was looking for. As a note, it might be helpful if that document were categorized by type, so that the meta-strategies like shared are in a separate list from the basic strategies. So I can generate an array shape and then pick a subsequence of it for the boolean array index shape. I'll play with this and see if does a better job at generating useful examples.

asmeurer commented 4 years ago

I've been using this to test the effectiveness of the strategies (after introducing a bug that would only be found if a good example is generated).

from hypothesis import seed, settings, Phase
from ndindex.tests.test_newshape import test_newshape_hypothesis
import warnings
warnings.filterwarnings('error')
func = settings(phases=[Phase.generate, Phase.shrink], max_examples=10000)(test_newshape_hypothesis)

N = 20
failures = 0
for i in range(N):
    print(i)
    try:
        seed(i)(func)()
    except KeyboardInterrupt:
        break
    except BaseException as e:
        print(e)
        failures += 1
print(failures, N, failures/N)

The base run with no shared strategies gets 0 failures (I didn't attempt to see how many runs are needed until it finds one). With a shared shape I can get it up to 95%, which falls down to 75% if I also allow nonshared shapes using one_of. It's still low enough that it probably won't find a bug in a normal test run, but high enough that I can be reasonably confident after running a few --max-examples=10000.

asmeurer commented 4 years ago

I'm a little disappointed that hypothesis never found the failing example here https://github.com/Quansight/ndindex/pull/80, even after running millions of tests. I think the issue with the as_subindex tests is that there are so many cases that give NotImplementedError, that if a tuple index contains any index that isn't implemented, the whole test is more or less skipped. I used to filter these, but I was getting filtering warnings, so I changed it to just ignore NotImplementedError. But I guess I shouldn't have ignored them. Trying to filter these cases isn't going to be easy, especially in a way that doesn't make hypothesis give warnings about it. The best solution is to just implement more of the missing cases, so that this isn't a problem any more.

Zac-HD commented 4 years ago

The best solution is to just implement more of the missing cases, so that this isn't a problem any more.

Yep. The usual workaround in the meantime would be to restrict your slices strategy so that it only generates cases you know you can handle :confused:

Quansight-Labs / ndindex

Hypothesis not finding failing examples #77