Test succeeds on travis-ci but fails locally, today

piccolbo commented 6 years ago

I am a bit puzzled by this irreproducible situation. This test passes on travis https://travis-ci.org/piccolbo/autosig and passed locally yesterday, but now doesn't anymore. It doesn't with the normal make test, but it does if I run the test interactively to try and debug it, as if the witness weren't in the example database. I rolled back to the commit that's on travis, to no avail. Now the test is funny, and I think I can fix it, but I feel like it would be a missed learning opportunity.

The test is somewhat complicated but it looks a little like this

@given(x=strategy(), y=strategy())
def test(x,y):
    assert x!=y

Since the range generated by the strategy is huge and I think like a statistician, this should pass on a meager 100 runs with probability 1. But as @DRMacIver himself explained to me in another issue, that's the wrong way to think about strategies and an assumption of independence or uniformity of any sort on the strategies is going to lead to pain. Indeed it looks like the witness has x==y, even if what is printed is not the full value. I fully accept that and I think a well placed assume in there will solve the problem. Nonetheless, the test was passing yesterday on the same commit. I am running the test in a virtual env, same as travis runs. Besides the failures, I noticed today an exorbitant number of invalid examples, only on the failing test.

tests/test_.py::test_decorated_call:

  - 100 passing examples, 0 failing examples, 5 invalid examples
  - Typical runtimes: 14-200 ms
  - Fraction of time spent in data generation: ~ 93%
  - Stopped because settings.max_examples=100

tests/test_.py::test_decorator_fails:

  - 100 passing examples, 0 failing examples, 5 invalid examples
  - Typical runtimes: 24-234 ms
  - Fraction of time spent in data generation: ~ 97%
  - Stopped because settings.max_examples=100

tests/test_.py::test_decorated_call_fails:

  - 14 passing examples, 162 failing examples, 540 invalid examples
  - Typical runtimes: 21-41 ms
  - Fraction of time spent in data generation: ~ 94%
  - Stopped because nothing left to do

The strategies are the same between the three tests. No assume or unique that I used explicitly. Complete file is here: https://github.com/piccolbo/autosig/blob/master/tests/test_.py. Any suggestion to get to the bottom of this would be appreciated. Hypothesis 3.69.12 (I downgraded to 3.30.0 to give it a try, but same results).

piccolbo commented 6 years ago

I learned about reproduce_failure. Investigating.

piccolbo commented 6 years ago

Can reproduce in pdb now. Test fail exactly as I expected, that is x==y in the simplified version above. The questions remain: why now and not before? Why did I need to use reproduce failure with the example database available?

piccolbo commented 6 years ago

Solved with assume. The number of invalid examples remained high though. Then I changed something unrelated elsewhere and it went back down to almost 0.

Zac-HD commented 6 years ago

[x] We should clearly document that generated values are arbitrary but not random, and should not be thought of as having any particular distribution or correlation. Include links to swarm testing, coverage-guided fuzzing etc. as examples of things we may do internally.
I found the data generation time surprisingly high, but it does make sense - you're generating a new class with attrs for each test case and then doing a fairly cheap call to check validation.
[x] The large number of invalid examples is exactly because it's a failing test: these are (mostly) failed attempts to generate a smaller input while shrinking a failing example. We should probably report statistics separately for replaying saved examples, for initial generation and for shrinking so this is visible.
I think the large number of invalid examples were from retrying previously failing examples from the database, which changes in your strategies had since made invalid. Either routine clearing of the DB (invalid examples are deleted) could have finished, or the test changed enough to move to a new DB key. This sounds like it's working as intended!

Unrelated tip: it looks like you're decorating all your tests with the same settings - check out the profiles mechanism, or just set the attribute on the global settings.default object!

Zac-HD commented 6 years ago

With specific notes for the action items, I'm going to close this issue - but remain happy to hear and help with any updates!

piccolbo commented 6 years ago

Fantastic input @Zac-HD, thanks ! The one thing I am still not understanding is why I could get the failure with a make test, then in ipython, same virtual env, I did a %load of the test file, run the failing test and it did succeed, seconds later. Shouldn't the bug witness be in the database at that point? What are the conditions that require using reproduce_failure on the same machine, in the same virtual env, with the example database available?

Two comments:

if randomization is only one of many possible search strategies, maybe randomized testing is not the correct name. Property-based and search based yes, but randomization is not the main component. Probabilities are not so important.
Yes to breaking out those stats as you described.

Zac-HD commented 6 years ago

It should indeed happen automatically, but there are too many rare-but-possible subtle problems to offer a confident diagnosis. My personal guess would be that there was some non-determinism introduced by e.g. hash randomization causing non-reproducible iteration order somewhere, but that's just a guess because it's bitten me before.

If you have examples where we've called it randomized testing in the docs, I agree that we should change them and would appreciate a pointer here or on #1514.
Issue already open 👍

HypothesisWorks / hypothesis

Test succeeds on travis-ci but fails locally, today #1554