elm-community / elm-test

moved to elm-explorations/test
https://github.com/elm-explorations/test
BSD 3-Clause "New" or "Revised" License
340 stars 35 forks source link

Should it be okay for invalid fuzzers to fail mid-test? #243

Open rtfeldman opened 6 years ago

rtfeldman commented 6 years ago

It seems like the two problems with Fuzz.andThen mentioned in https://github.com/elm-community/elm-test/issues/161 boil down to:

  1. Shrinking performance is often bad. However, as https://github.com/elm-community/elm-test/issues/237 notes, Fuzz.andThen is only one of several ways to end up with slow tests when using fuzzers. If we need a general solution to that problem anyway, then removing Fuzz.andThen from the API doesn't solve it.
  2. "Invalid fuzzers" can go unnoticed at runtime if they are hidden behind a Fuzz.andThen. They may only sometimes show up as invalid. See https://github.com/elm-community/elm-test/issues/160

It seems that we have only two functions that can potentially return invalid fuzzers:

  1. Fuzz.oneOf. This doesn't need to use the concept of invalid fuzzers. It can be oneOf : Fuzzer a -> List (Fuzzer a) -> Fuzzer a so that it always has at least one value to work with.
  2. Fuzz.frequency. Even it did the same "one or more" trick as Fuzz.oneOf, it would still have two invalid cases: if any of its weights aren't positive numbers, or if all the weights do not add up to a positive number.

One potential solution would be if we didn't have Fuzz.frequency and instead only had Fuzz.oneOf with the "requires at least one element at compile time" API.

The example in the frequency docs is:

Fuzz.frequency
    [ ( 1, Fuzz.constant Leaf )
    , ( 2, Fuzz.map2 Branch (tree (i - 1)) (tree (i - 1)) )
    ]

This could also be written:

Fuzz.oneOf
    (Fuzz.constant Leaf)
    (List.repeat 2 (Fuzz.map2 Branch (tree (i - 1)) (tree (i - 1)) ))

In my opinion, this is not actually better. It's worse for performance, and List.repeat has the same problem as List.frequency: what happens when you pass it negative numbers? We're just shifting the burden elsewhere.

Realizing this made me rethink whether it's such a problem after all that you can get invalid fuzzers midway through a test run. The whole point of fuzzers is to expose edge cases; by using them at all, we accept that some of our tests will fail on some runs and not others - in fact, we're often hoping for that to happen!

Assuming we fixed the error message in https://github.com/elm-community/elm-test/issues/160 to be nice, how bad would it actually be if sometimes fuzzers turned out to be invalid midway through a run, leading to a failed test? It doesn't seem likely to lead to a whole lot of frustration in practice.

Note: If we did this, and kept andThen, I'd want to change both oneOf and frequency to accept the extra argument, so if you're using them in an andThen it's super clear that you need to provide at least one element in the list. We don't need to bikeshed about that now though!

Thoughts?

rtfeldman commented 6 years ago

cc @mgold @drathier @jwoudenberg @avh4 @Janiczek

drathier commented 6 years ago

Tons of questions, we could split a few over to different issues.

I thought we wanted to remove andThen because of poor performance, mainly in shrinking? If we're sure we'll hit a user-fixable runtime error on the first run if someone makes this mistake, like frequency [], that doesn't seem like a problem to me. Maybe I'm not using large enough test suites for the failed run to matter? Relatedly, why do we have the concept of an invalid fuzzer? Who's using it?

Quick-fix for the out of memory issues could be to increase node heap size with --max_old_space_size.

by using them at all, we accept that some of our tests will fail on some runs and not others - in fact, we're often hoping for that to happen!

I use fuzzers because they're great at finding new bugs. I don't want them to forget about old bugs it found previously. If that means auto-generating tests, manually adding edge-cases or something else, I don't know. I'm using elm-test-tables right now for manually storing edge cases in test suites, but python hypothesis stores inputs (per-test seeds) in a file and always runs the seeds in that file before it starts fuzzing. Thus, you can never get a test suite that once failed CI to go through by retrying. There's still the problem of tests going through on the first attempt, and failing someone else's PR by blowing up later, but that's not fixable with fuzzers. Ah well.

Regarding the slow tests, can we look into choosing the number of runs for each fuzz test based on how large the fuzzer is? Maybe add common edge cases with each fuzzer? Maybe store a weight in each fuzzer of roughly how many interestingly unique values it has, so we don't generate as many bool as (int, int, int)?

Janiczek commented 6 years ago

@drathier I couldn't find any "hanging" fuzzers. Issue #132 is probably relevant here: it says that problem is resolved. So, the reasons for removing Fuzz.andThen are, I think, either based on outdated knowledge ("Fuzz.andThen tests hang when not written carefully") or philosophical rather than performance-related ("Fuzz.andThen can return invalid fuzzer", ie what this issue and #160 is about.).

mgold commented 6 years ago

based on outdated knowledge ("Fuzz.andThen tests hang when not written carefully")

I went back to every Fuzz.andThen issue and I think the current behavior (0.18.12 runner) is acceptable.

62: What used to hang indefinitely now terminates quickly.

105: Shrinking now longer produces a value that couldn't have made the test fail, or something.

132: Fixed by @jwoudenberg in #172.

160: andThen with frequency [] still crashes, but at least we get a nice error message inline with the other test failures.

If we keep andThen, we should add regression tests based on these earlier reports.

The concept of an "invalid fuzzer" was created because it seemed to be coming up in a few different cases, but it's actually just the two that Richard mentioned at the top of the thread. oneOf can be fixed with an API change. So I'm starting to agree with Richard: just have frequency throw if your math is wrong, rather than introduce a whole concept into the codebase.

Meanwhile, it seems that we have a working implementation of andThen and the possibility of using it with frequency to blow up doesn't seem that bad. The biggest argument for ditching it is that a general shrinking algorithm won't do a good of a job as one that's familiar with your data structure. This is true, but I think the general algorithm can do well enough to be worthwhile. (I might be inclined to stick it in a separate file -- the implementation is pretty hairy.)

It's also possible that @drathier's idea of choosing the number of runs for each fuzz test based on how large the fuzzer is can also tie into a smarter shrinking strategy? It seems like all of these ideas are connected and I can't disentangle them yet.

jwoudenberg commented 6 years ago

Overall, at this point in my life I don't feel strongly that andThen should be removed. But also I've stopped using it completely in my code so I haven't had any recent experience with it. I don't see much in the way of issues reported about it though (at least not recently), so that's a good sign.

Shrinking performance is often bad. However, as #237 notes, Fuzz.andThen is only one of several ways to end up with slow tests when using fuzzers. If we need a general solution to that problem anyway, then removing Fuzz.andThen from the API doesn't solve it.

I think this assumes a general solution for fuzzer-releated performance issues exists. Barring such a solution our only recourse is to attack each performance problem separately.

Realizing this made me rethink whether it's such a problem after all that you can get invalid fuzzers midway through a test run. The whole point of fuzzers is to expose edge cases; by using them at all, we accept that some of our tests will fail on some runs and not others - in fact, we're often hoping for that to happen!

I don't completely agree with this. If a test occasionally fails because of a bug of the code under test, that's property testing striking gold. If a test occasionally fails because of a bug in the test, that's test flakiness. Few things are hated as universally as test flakiness 😅.

Regarding the slow tests, can we look into choosing the number of runs for each fuzz test based on how large the fuzzer is?

I think this is a good idea, but I would use it to run more tests if the fuzzer is complicated. As it stands it's weird that we generate run the same 100 test cases when fuzzing a single Int that we do when fuzzing some complicated Model. I think a reasonable expectation users might have is that both the simple and more complicated fuzz tests both get tested equally thoroughly.

This would make test performance worse though.

Maybe add common edge cases with each fuzzer? Maybe store a weight in each fuzzer of roughly how many interestingly unique values it has, so we don't generate as many bool as (int, int, int)?

😻 These are great!

If we're sure we'll hit a user-fixable runtime error on the first run if someone makes this mistake, like frequency [], that doesn't seem like a problem to me.

I think the primary problem of the the runtime error is its poor error message.

The biggest argument for ditching it is that a general shrinking algorithm won't do a good of a job as one that's familiar with your data structure.

Yeah, I think this is at the moment the biggest disadvantage of using andThen. When I originally proposed removing andThen I looked at a number of ways people were using it. Most of the scenario's where andThen was used could be rewritten to use mapN or andMap, which would have resulted in better shrinking results. So for me the argument was: there's a better way to write these tests, let's make it the only way you can write them.

I feel if we keep andThen we should add a note to the documentation advising it is avoided if possible.