Integrated shrinking approach

Janiczek commented 4 years ago

fixes #149

Swap the current rose-tree approach for "integrated shrinking" one

Based on Hypothesis / Minithesis / elm-microthesis.

Instead of simplifying generated values, we add a layer of indirection: remember the PRNG drawn numbers (called RandomRun, can be thought of as a List Int) from which we generated the values, simplify the RandomRun irrespective of the values or fuzzers used, and then try generating values from the shrunk RandomRun again. Fuzzers are constructed in such a way to nicely synergize with the simplifying strategies (called SimplifyCmd).

Explained in more detail in:

Hypothesis blogpost "Integrated vs type based shrinking"
Minithesis repo
Introductory video to the "Writing a property-based testing library in Elm" YouTube series

Guide for elm-test developers

Report from trying to run on real-life repos

Draft of migration docs

Janiczek commented 4 years ago

Here are some benchmarks (make sure you're on the #2 sheet) comparing elm-test master against elm-test on this PR with either Skinney/elm-deque or turboMaCk/queue (and the same benchmark in elm-microthesis).

bench

It's curious to me that even the unit generator (which is as no-op as you can get) has such a perf drop when generating. There might be something fishy going on still.

Also I'll try and add benchmarks for some real-world test suites. Might tell a different story than isolated fuzzers.

EDIT: it might be a good idea to start talking about whether this PR is even feasible with such benchmark results, or what would you need to see / which questions would you need answered before making up your mind.

harrysarson commented 4 years ago

what is the difference between elm-test master and elm-test PR?

Janiczek commented 4 years ago

@harrysarson Sorry, I didn't send the links. Basically in my fork of elm-test there are three tags: benchmark-master, benchmark-pr and benchmark-pr-deque.

diff

elm-test master = tag benchmark-master = current elm-explorations/test master + stop after first failure elm-test PR = tag benchmark-pr elm-test PR+deque = tag benchmark-pr-deque

harrysarson commented 4 years ago

ah okay! Makes sense. Do you know why your PR (that adds microthesis to elm-test) is slower than microthesis itself?

Janiczek commented 4 years ago

@harrysarson Probably there is some overhead of elm-test doing more stuff overall compared to microthesis. (unit tests, wrapping Expectation instead of a -> Bool, failure diffing, etc.

Microthesis is basically just a function for running one fuzz test, without a lot of the bells and whistles.

harrysarson commented 4 years ago

Of course :) thanks for explaining it to me. One last question: I see there are some Debug.log added to the source, they don't slow down the benchmarks do they?

Janiczek commented 4 years ago

@harrysarson I've removed the Debug.logs when running the benchmarks. If kept in, they would slow the benchmarks down by a factor of 100x-1000x.

harrysarson commented 4 years ago

Brilliant okay, ignore me then!

Janiczek commented 4 years ago

I think I need a bit of external push here.

@drathier @harrysarson @rtfeldman @avh4 What do you feel needs work/answering the most right now? The TODOs in the PR description are everything but the kitchen sink, I feel it will be better to prioritize rather than try to do everything, in whatever order.

drathier commented 4 years ago

I'd like to know what the perf difference of e.g. shrinking triple ints comes from, compared to elm-test master. Does µthesis do any fancy shrinking that's likely to result in a significant speedup, e.g. cutting list length in half or cutting numbers in half?

Janiczek commented 4 years ago

@drathier I've run the master vs microthesis genAndShrink functions from the benchmarks against each other outside of elm-explorations/benchmark - building on these:

And the original elm-test is quite consistently a bit faster. When removing the Debug.logs and running on seeds 0..500 instead of 0..10, I believe I got 2.9s for master and 3.1s for PR when profiling in Chrome.

That's curious - I don't know where does the slowdown compared to benchmarks comes from.

I'll post a writeup with some examples of how the PR shrinks differently from master (shown by comparing Debug.log outputs from within the user test fn), and some interesting cases (eg. lists and floats will be interesting and differ the most I feel).

drathier commented 4 years ago

Basically, I'd be happy to merge anything that's within a 2x slowdown for real-world projects. Bet that threshold differs for different people in different projects, but that's my personal line. I'm confident we can find smaller improvements to close that gap over time, probably even before first release. However, any super-linear slowdown is harder to tackle. I'd recommend you completely disregard any perf diff smaller than +20%, or even +100% for now. Keep in mind that both cache effects and jit effects will significantly effect your benchmarks.

Wanna hop on a slack call sometime this week to discuss this pr?

harrysarson commented 4 years ago

Some personal thoughts regarding performance

This is an experimental package, an exploration if you like, and so I think it is very reasonable to regress performance to the greater aim of exploration. In the long term, we will get better performance with testing in elm by being willing to experiment and continuously improve than my maintaining a super stable library.

The biggest risk for elm-test is that is falls out of maintance (this repo had only one person doing all the maintance untill last month and I am the only one maintaining (and I am doing so in bare minimum capacity) the node-test-runner repository). I would much prefer see an uptick in interest on this repository with an interesting changes such as those in this PR --- even if those changes degrade runtime performance --- than for people to be put off contribution to elm-test by a high bar to accepting changes.

If too many people rely on elm-test to risk make these explorations (which is possible) then we should be explicit about that and move this library to the elm/* namespace.

Janiczek commented 2 years ago

Here is my investigation into what migrations would be needed between 1.2.2 and 2.0.0 (with this PR included): https://docs.google.com/document/d/12rt3KV287AiFlR4OmOE14HPOeE5Zg_2NkS8puIkjGQM/edit?usp=sharing

TL;DR:

tuple/tuple3 -> pair/triple
Expect.true/false (is removing these functions a good idea?)
translations of Random.* code into Fuzz.* code is now easy, 1-to-1
the tests later found some issues, mainly around floats. Not just around infinities and NaNs, sometimes it just picks more nasty number examples!

drathier commented 2 years ago

Okey, so how do we go about merging a 2 year old 7500 line change PR? Could you write a short guide/intro to the new shrinker approach for us other elm-test devs, or if you have already done that, could you (re)link it here?

I've read the document, I like the test suite setup, I'm happy that you found multiple issues in code I've written. We should run that setup on as much of the elm package set as feasible.

How do we go about upgrading the elm package set to this new shrinking approach? This will be a big improvement for the future, I think, but it'll come with quite a cost right now. How can we make upgrading as easy as feasible for people? Is backwards compatibility a good thing, i.e. having a Fuzzer type that just doesn't use the given shrinker? Can automagically upgrade test suites for people? Do we have a small todo-list for how to upgrade a test suite? Should we add before/after examples showing how we've migrated fuzzers? What's a good story here? Keep in mind there's other changes being released at the same time, many of which are many years old at this point.

Re true/false, they were removed intentionally after much debate. https://github.com/elm-community/elm-test/pull/242 It's too easy to reach for Expect.true when there's usually a better expectation to use. And if there isn't, you can Expect.equal True.

Janiczek commented 2 years ago

Could you write a short guide/intro to the new shrinker approach for us other elm-test devs?

There is no short guide, although this video slightly touches on the main differences between before and after. I will write a guide/tour for elm-test devs though!

We should run that setup on as much of the elm package set as feasible.

I can imagine writing a script that downloads the whole package set from search.json and checks:

whether there is a test suite
whether there are fuzz tests used
whether there is a Shrink module import

And later, if we have an automatic upgrade helper, we could try automatically running it on all these downloaded packages and seeing if it works everywhere?

(Stating the obvious: it would have to deal with the whole of 1.2.2 -> 2.0.0 upgrade, not just with the new Fuzz API)

Is backwards compatibility a good thing, i.e. having a Fuzzer type that just doesn't use the given shrinker?

I would rather do what you suggest below (auto-upgrade test suites) and have a no-compromise clean API rather than keep a compatibility layer around.

Can we automagically upgrade test suites for people?

I'd love this approach the most. Let's check with @avh4 on whether elm-upgrade could be repurposed for at least the easy changes, like Expect.true "foo" -> Expect.equal True and Fuzz.tuple (a,b) -> Fuzz.pair a b?

I don't hope this would be doable for the larger changes (Fuzz.custom someGenerator whateverShrinker -> someFuzzer, with the generator being translated into the fuzzer); sometimes the generator is a part of the codebase proper and shouldn't be touched. The fuzzer could be done next to it though, without touching the generator :shrug: Anyways, I have no idea whether elm-upgrade would be able to do this. I could provide the rules for the Random->Fuzz translation.

Do we have a small todo-list for how to upgrade a test suite? Should we add before/after examples showing how we've migrated fuzzers? What's a good story here? Keep in mind there's other changes being released at the same time, many of which are many years old at this point.

I think this is a must. We don't have it, but let's write one. I can write the Fuzz translation examples and table of changes and some general "how to think about these in 2.0.0" prose, but don't know enough about the other changes (eg. idiomatic way to get rid of Expect.true) so let's collaborate 😇

Re true/false, they were removed intentionally after much debate. elm-community/elm-test#242 It's too easy to reach for Expect.true when there's usually a better expectation to use. And if there isn't, you can Expect.equal True.

Ah, now that I've checked #242, I see that you can have Expect.equal True AND keep the helpful message with Expect.onError! I didn't realize that.

Janiczek commented 2 years ago

@drathier @harrysarson The high-level guide for elm-test developers/maintainers: elm-test-shrinking.pdf. Please ask any questions that come up! I'll be glad to answer. (Also, to hop on a Zoom/Discord/... call if that helps.)

Janiczek commented 2 years ago

Here's a draft of the translation / migration notes. Based on elm diff for the microthesis branch. https://gist.github.com/Janiczek/2e5cf91694851866fda9089d649baad9

Janiczek commented 2 years ago

Thanks @harrysarson ! Let's work on the 2.0.0 preparations then 🙂 I'll probably start with finishing the release docs draft and then/meanwhile we could get the experimental flag for test runners going that you mentioned on Discord?

elm-explorations / test