Open rtfeldman opened 6 years ago
cc @mgold @drathier @jwoudenberg @avh4 @Janiczek
Tons of questions, we could split a few over to different issues.
I thought we wanted to remove andThen
because of poor performance, mainly in shrinking? If we're sure we'll hit a user-fixable runtime error on the first run if someone makes this mistake, like frequency []
, that doesn't seem like a problem to me. Maybe I'm not using large enough test suites for the failed run to matter? Relatedly, why do we have the concept of an invalid fuzzer? Who's using it?
Quick-fix for the out of memory issues could be to increase node heap size with --max_old_space_size
.
by using them at all, we accept that some of our tests will fail on some runs and not others - in fact, we're often hoping for that to happen!
I use fuzzers because they're great at finding new bugs. I don't want them to forget about old bugs it found previously. If that means auto-generating tests, manually adding edge-cases or something else, I don't know. I'm using elm-test-tables right now for manually storing edge cases in test suites, but python hypothesis stores inputs (per-test seeds) in a file and always runs the seeds in that file before it starts fuzzing. Thus, you can never get a test suite that once failed CI to go through by retrying. There's still the problem of tests going through on the first attempt, and failing someone else's PR by blowing up later, but that's not fixable with fuzzers. Ah well.
Regarding the slow tests, can we look into choosing the number of runs for each fuzz test based on how large the fuzzer is? Maybe add common edge cases with each fuzzer? Maybe store a weight in each fuzzer of roughly how many interestingly unique values it has, so we don't generate as many bool
as (int, int, int)
?
@drathier I couldn't find any "hanging" fuzzers. Issue #132 is probably relevant here: it says that problem is resolved. So, the reasons for removing Fuzz.andThen
are, I think, either based on outdated knowledge ("Fuzz.andThen
tests hang when not written carefully") or philosophical rather than performance-related ("Fuzz.andThen
can return invalid fuzzer", ie what this issue and #160 is about.).
based on outdated knowledge ("Fuzz.andThen tests hang when not written carefully")
I went back to every Fuzz.andThen
issue and I think the current behavior (0.18.12 runner) is acceptable.
andThen
with frequency []
still crashes, but at least we get a nice error message inline with the other test failures.If we keep andThen
, we should add regression tests based on these earlier reports.
The concept of an "invalid fuzzer" was created because it seemed to be coming up in a few different cases, but it's actually just the two that Richard mentioned at the top of the thread. oneOf
can be fixed with an API change. So I'm starting to agree with Richard: just have frequency
throw if your math is wrong, rather than introduce a whole concept into the codebase.
Meanwhile, it seems that we have a working implementation of andThen
and the possibility of using it with frequency
to blow up doesn't seem that bad. The biggest argument for ditching it is that a general shrinking algorithm won't do a good of a job as one that's familiar with your data structure. This is true, but I think the general algorithm can do well enough to be worthwhile. (I might be inclined to stick it in a separate file -- the implementation is pretty hairy.)
It's also possible that @drathier's idea of choosing the number of runs for each fuzz test based on how large the fuzzer is
can also tie into a smarter shrinking strategy? It seems like all of these ideas are connected and I can't disentangle them yet.
Overall, at this point in my life I don't feel strongly that andThen
should be removed. But also I've stopped using it completely in my code so I haven't had any recent experience with it. I don't see much in the way of issues reported about it though (at least not recently), so that's a good sign.
Shrinking performance is often bad. However, as #237 notes, Fuzz.andThen is only one of several ways to end up with slow tests when using fuzzers. If we need a general solution to that problem anyway, then removing Fuzz.andThen from the API doesn't solve it.
I think this assumes a general solution for fuzzer-releated performance issues exists. Barring such a solution our only recourse is to attack each performance problem separately.
Realizing this made me rethink whether it's such a problem after all that you can get invalid fuzzers midway through a test run. The whole point of fuzzers is to expose edge cases; by using them at all, we accept that some of our tests will fail on some runs and not others - in fact, we're often hoping for that to happen!
I don't completely agree with this. If a test occasionally fails because of a bug of the code under test, that's property testing striking gold. If a test occasionally fails because of a bug in the test, that's test flakiness. Few things are hated as universally as test flakiness 😅.
Regarding the slow tests, can we look into choosing the number of runs for each fuzz test based on how large the fuzzer is?
I think this is a good idea, but I would use it to run more tests if the fuzzer is complicated. As it stands it's weird that we generate run the same 100 test cases when fuzzing a single Int
that we do when fuzzing some complicated Model
. I think a reasonable expectation users might have is that both the simple and more complicated fuzz tests both get tested equally thoroughly.
This would make test performance worse though.
Maybe add common edge cases with each fuzzer? Maybe store a weight in each fuzzer of roughly how many interestingly unique values it has, so we don't generate as many bool as (int, int, int)?
😻 These are great!
If we're sure we'll hit a user-fixable runtime error on the first run if someone makes this mistake, like frequency [], that doesn't seem like a problem to me.
I think the primary problem of the the runtime error is its poor error message.
The biggest argument for ditching it is that a general shrinking algorithm won't do a good of a job as one that's familiar with your data structure.
Yeah, I think this is at the moment the biggest disadvantage of using andThen
. When I originally proposed removing andThen
I looked at a number of ways people were using it. Most of the scenario's where andThen
was used could be rewritten to use mapN
or andMap
, which would have resulted in better shrinking results. So for me the argument was: there's a better way to write these tests, let's make it the only way you can write them.
I feel if we keep andThen
we should add a note to the documentation advising it is avoided if possible.
It seems like the two problems with
Fuzz.andThen
mentioned in https://github.com/elm-community/elm-test/issues/161 boil down to:Fuzz.andThen
is only one of several ways to end up with slow tests when using fuzzers. If we need a general solution to that problem anyway, then removingFuzz.andThen
from the API doesn't solve it.Fuzz.andThen
. They may only sometimes show up as invalid. See https://github.com/elm-community/elm-test/issues/160It seems that we have only two functions that can potentially return invalid fuzzers:
Fuzz.oneOf
. This doesn't need to use the concept of invalid fuzzers. It can beoneOf : Fuzzer a -> List (Fuzzer a) -> Fuzzer a
so that it always has at least one value to work with.Fuzz.frequency
. Even it did the same "one or more" trick asFuzz.oneOf
, it would still have two invalid cases: if any of its weights aren't positive numbers, or if all the weights do not add up to a positive number.One potential solution would be if we didn't have
Fuzz.frequency
and instead only hadFuzz.oneOf
with the "requires at least one element at compile time" API.The example in the frequency docs is:
This could also be written:
In my opinion, this is not actually better. It's worse for performance, and
List.repeat
has the same problem asList.frequency
: what happens when you pass it negative numbers? We're just shifting the burden elsewhere.Realizing this made me rethink whether it's such a problem after all that you can get invalid fuzzers midway through a test run. The whole point of fuzzers is to expose edge cases; by using them at all, we accept that some of our tests will fail on some runs and not others - in fact, we're often hoping for that to happen!
Assuming we fixed the error message in https://github.com/elm-community/elm-test/issues/160 to be nice, how bad would it actually be if sometimes fuzzers turned out to be invalid midway through a run, leading to a failed test? It doesn't seem likely to lead to a whole lot of frustration in practice.
Thoughts?