Use integrated shrinking instead of type-based shrinking

Janiczek commented 4 years ago

TL;DR: I have a working Elm code for fuzzers that shrink automatically (no user-defined shrinkers needed) and allow for non-problematic andThen. I'd like to get elm-explorations/test to use it instead of the current QuickCheck-like shrinking.

Also: https://discourse.elm-lang.org/t/elm-minithesis-shrinking-without-compromises/6071

Also for differences between the two types of shrinking: https://hypothesis.works/articles/integrated-shrinking/

Hello!

Recently @'drmaciver has created a minimalistic version of Hypothesis called Minithesis.

(As you might know, Hypothesis is a property-based testing library that uses a different approach to shrinking which allows it to use andThen without issues etc.)

I've ported it to Elm: elm-minithesis.

My question is: would you be up to having it or your own implementation of it in elm-explorations/test? IMHO there are benefits to it; I still have to benchmark / test / compare elm-minithesis against elm-explorations/test to realize where the drawbacks are.

PS: Having one testing library instead of many is, I think, a nice property of the Elm ecosystem, so even if you didn't want to incorporate this approach into elm-explorations/test, I most likely won't be creating a new CLI tool from the ground up, and instead will just publish it as an Elm package which will interop with elm-explorations/test as seamlessly as possible.

harrysarson commented 4 years ago

Just to flag: we have a major change on master renaming "Shrinking" to "Simplifying". Presumably we would want to do the rename on minithesis style fuzzers too if we like them. (I like them!)

drathier commented 4 years ago

I had the same thought when I read your code. I'd say let's wait a bit and see how well minithesis works first, and then take a look. It feels a bit odd to have both, and any fuzzers people have written already would have to be manually ported. If it works as well as I think, I'd be more than happy to switch over to it.

mgold commented 4 years ago

This looks like some great work and it should be incorporated into the library. With, as Harry mentioned, the change from shrinking to simplifying.

The current implementation is workable, but it definitely has issues and I'm not worried about replacing it outright. It sounds like David McIver, who wrote the Hypothesis article, has done a lot of thought on this matter I can believe him when he says to do it his way. Also, I've been dropping out of the community (nothing wrong with you folks or the language, "it's not you it's me") so passing the torch sounds like a good idea regardless.

You should know, @Janiczek, that you're signing up to be the person who maintains all of this. I can't give you write access this repo but it's probably time for that to happen.

Janiczek commented 4 years ago

@mgold Thanks for the kind words!

Who are the current maintainers of this library then? Was it just you, or also @drathier / @rtfeldman / etc? I don't see into the structure, and there is no MAINTAINERS file or humans.txt or CODEOWNERS or any similar file to learn this from.

Also who has the admin/write rights then? Considering the elm-explorations org this is under, probably @evancz?

I think if I'm about to take over, I would like to share the maintainer responsibility with someone. My thought was to write this generating+shrinking logic again from scratch with somebody, explaining what I've learned when porting that library along the way. (The core of the current Janiczek/elm-minithesis code is largely done in a "port as closely as possible" way, I'm sure we could find nicer / more idiomatic ways to do things.)

Seems also like a good way to increase bus factor. Would any of the current maintainers / folks related to this repo be interested in doing this with me?

harrysarson commented 4 years ago

pickme :angel:

drathier commented 4 years ago

I think I'm the only person with write-access who's actively looking at issues here atm, and I'd be happy to continue doing that. I think myself, @harrysarson , @Janiczek and @rtfeldman would be a great team going forward. I'd be more than happy to take a more coordinating role; we've kinda lacked leadership and a way of working for a while now.

Janiczek commented 4 years ago

Sounds great! @rtfeldman can you confirm your status and thoughts here?

So, would the next step forward with this be to start a PR? Or do we need to answer some questions first, like benchmarks etc?

(If PR :heavy_check_mark: , would you like to pair with me on this or should I do this alone and focus on clear readable commits?)

rtfeldman commented 4 years ago

@Janiczek invited! Also @harrysarson considering how long you've been a contributor to node-test-runner, I added you while I was at it.

I think the next step is benchmarks. Basically I'd want to avoid the scenario where we do a release with improvements to functionality but various people saying "ahhhhh it's so slow now!" 😄

Welcome aboard!

harrysarson commented 4 years ago

Thanks Richard!

Janiczek commented 4 years ago

Thanks Richard!

Sneak peek: for some nested fuzzers like tuple3 (tuple string oneOf) (tuple string oneOf) (tuple string oneOf) minithesis fully shrunk it in 0.3s where elm-test did not finish in time (shrinking the bottommost string took ~2 seconds for every try; it seems there is some weird asymptotic behaviour going on with the rose tree).

https://streamable.com/mef4nw

So I have big big hopes for some O(...) improvements :)

Janiczek commented 4 years ago

I have a benchmark spreadsheet in progress here: https://docs.google.com/spreadsheets/d/1Yk2smqvaT6KfnNBDdMclBTIsjAnoWnwedI31JmPOrdE/edit?usp=drivesdk

But before I fill it with more examples, could you please look at the benchmark files to see if I'm eg. not comparing apples to oranges? https://github.com/Janiczek/elm-minithesis/tree/master/benchmarks/vs-elm-test

My main worry is that I'm getting better numbers because elm-minithesis with runs = 100 stops after first failure, whereas elm-test with runs = 100 probably tries generating 100 times even if it found a failing example in the first one? Can somebody confirm this hunch?

(If it's correct, I should first make the two libraries behave the same - either way)

harrysarson commented 4 years ago

If current elm test does keep generating samples even if it finds one that fails then surely this will still mean that users have to wait for all the samples to be generated before the test run completes. If monithesis skips those runs then that is a performance win in and of its self?

Janiczek commented 4 years ago

There wasn't much rationale to find in the code: https://github.com/elm-explorations/test/blob/master/src/Test/Fuzz.elm#L60

So it's probably "just the way it is", unless @mgold remembers more from the time of implementing this?

Anyways I'll try to change elm-test to only do one failure to get more apples-to-apples comparison, and try to average these benchmarks over multiple seeds, so that one implementation doesn't "get lucky".

Janiczek commented 4 years ago

Good news everyone!

I present to you the benchmark spreadsheet. Data collected on these branches:

elm-test 1.2.2 (vanilla)
elm-test 1.2.2 (stops after one failure) - here is the change
elm-minithesis
elm-microthesis, RandomRun as List Int
elm-microthesis, RandomRun as Array Int - here is the change
elm-microthesis, RandomRun as Queue Int - here is the change

gen1 gen2

Those were averaged over 5 seeds, but I'm a bit squeamish about the variance especially in the list of ints test. The results vary wildly based on how "lucky" they get with the seed (11k runs/s at seed 0, 298k runs/s at seed 4 :shrug: ), and I'm not sure average is the right choice here. (Median maybe? Dunno)

I could increase the number of seeds I run this over, but I feel like there's a sensible limit to how many numbers I'm willing to retype into the spreadsheet. Maybe if we could control elm-explorations/benchmark from within Elm :heart_eyes: we could try with as many seeds as we'd want.

Any ideas? Statistics aren't my strongest suite so I'm not sure about my hunch about the median being a better choice.

Also what do you think about the choice of (always False) / (always (Expect.fail "x")) for the user test function? That makes the shrinker run for all generated values until it gets to some minimal value; I'm not sure if that's a good idea or not.

Perhaps we could have another set of benchmarks for when the user test fn is actually costly? I think more efficient shrinking might shine there (running the user test fn less often).

As you can see, I branched elm-minithesis off into elm-microthesis, which is written in a more "from the ground up", idiomatic Elm style, and the numbers show it. A lot of weird Result wrappers and unnecessary custom types got thrown away, so there's less busywork. I'm quite excited about it.

Any suggestions for a real-world fuzzer test suite that I could try to port over to microthesis, for a more, well, real-world example in the benchmark suite?

My interpretation of the benchmarks:

generation of values in elm-microthesis(queue) is on par with elm-test. Which was a surprising result to me after living a few weeks knowing about the elm-minithesis 3x slowdown -- so I'm glad it turned out that way, not complaining! You can see in the unit case (which is just constant and least work in all implementations, with no randomness and no shrinking) that elm-test still beats elm-microthesis there: elm-microthesis has more housekeeping costs if you remove all the stuff dealing with randomness.
finding all 100 examples was indeed what slowed down elm-test so much - when I changed it to stop after first found failure, it suddenly became the leader of the pack for shrinking for the fuzzers I tested it on. (But, as seen in the message above, there are some more complicated fuzzers it doesn't even finish on.)

Unsure what the triple ints slowdown for shrinking is about for Microthesis. My hunch is that it's less efficient to do

map3 : (a -> b -> c -> d) -> Generator a -> Generator b -> Generator c -> Generator d
map3 fn a b c =
constant fn
    |> andMap a
    |> andMap b
    |> andMap c

than to expand and thread it fully like

map3 : (a -> b -> c -> d) -> Generator a -> Generator b -> Generator c -> Generator d
map3 fn (Generator generatorA) (Generator generatorB) (Generator generatorC) =
Generator <|
    \prng ->
        case generatorA prng of
            Generated a ->
                case generatorB a.prng of
                    Generated b ->
                        case generatorC b.prng of
                            Generated c ->
                                Generated
                                    { value = fn a.value b.value c.value
                                    , prng = c.prng
                                    }

                            Rejected r ->
                                Rejected r

                    Rejected r ->
                        Rejected r

            Rejected r ->
                Rejected r

I'll benchmark :grin:

EDIT: roads not travelled:

list zipper as the RandomRun data structure
reversed list as the RandomRun data structure
cache (Patricia Tree / Trie / simple Dict) of RandomRun -> GenResult (memoization)

Any feedback on benchmarks?

Do you think it's a good time to try to replace the current fuzz/shrink code of elm-test with the microthesis(queue) one on a branch / in my elm-test fork?

Janiczek commented 4 years ago

After reimplementing map3: generation got better (quite a lot), shrinking is mostly the same

gen1

(I didn't commit these changes to not mess with the tags and branches for benchmarks above, but it's literally just that map3 reimplementation. I suppose other mapN functions would also benefit from the same thing.)

Janiczek commented 4 years ago

Based on comments by @turboMaCk and @Skinney I tried a deque implementation. Seems worse than queue for generation but better for shrinking.

elm-minithesis@benchmark-deque

deque

robinheghan commented 4 years ago

@Janiczek I'm pretty certain I can significantly improve the performance of several deque functions (like left, dropLeft), as well introduce a few functions (like get, set, compare) that would be beneficial to you. I can probably get it done in about a week.

Would that be of interest?

Janiczek commented 4 years ago

@Skinney I would certainly use them if they were present! :heart:

Don't want to put any pressure on you though. Swapping the implementations is pretty easy to do so we can go with queue one for now and then re-benchmark+switch after any new deque release. So this issue IMHO isn't blocked by it, strictly speaking.

robinheghan commented 4 years ago

@Janiczek No pressure at all! I get a kick out of stuff like this. Just wanted to make sure you hadn't already made up your mind about not using a deque.

Quick question. It seems generation (as opposed to gen+shrink) only makes use of pushBack, is that right? If so, only the gen+shrink case can be improved as pushFront and pushBack are as fast as they can reasonably be.

Janiczek commented 4 years ago

@Skinney yes, in the tweet I mistakenly attributed popFront to generation but it's only used in shrinking.

Shrinking makes use of popFront, left, either dropLeft or right (right seemed more performant when I compared the two so I kept it), and would benefit from (non-existing) repeat, get, set and perhaps slice.

turboMaCk commented 4 years ago

@Janiczek Just in case you feel it might make more sense to stick to FIFO let me know if I should extend the API. I'm not sure if I would be able to optimize the implementation further since It's been a while since I opened that project but feel free to just open issues or ping me directly.

Janiczek commented 4 years ago

@turboMaCk what I was missing in the queue implementation was slice, left / take, right / dropLeft and concat / append, get and set. But I'm not sure those are even idiomatic operations for the data structure (same thing applies to deque, I feel).

Essentially you can look at the code and search for the Queue.toList escape hatch.

turboMaCk commented 4 years ago

I think it would at least make sense to provide enough so you can define all of these without need for toList which is quite expensive operation.

robinheghan commented 4 years ago

@Janiczek Yeah. Ideally, you'd want an RRBT implementation of Array, which I sadly never got around to implement (I have four unmerged PRs in elm/core that are a year old now, so I wouldn't count on it happening any time soon either).

robinheghan commented 4 years ago

@Janiczek Here you go: https://package.elm-lang.org/packages/Skinney/elm-deque/latest/

left, right, dropLeft and dropRight are now 4x faster

Introduced initialize,repeate and range functions for faster construction of deques. Should also be 4-5x faster than List.repeat >> Deque.fromList.

I didn't get around to get and set, that will have to come at a later time.

I explored for a long while how to make left, right, dropLeft, dropRight as fast as append, and while I think I know how, it's going to take a while.

In any case, it should be really interesting to see how this latest version affects your benchmark :)

Janiczek commented 4 years ago

@Skinney Wow! I'm gonna try that, will post results here.

turboMaCk commented 4 years ago

I think @Skinney's implementation is way to go btw. For case as complex as this I think it is better suited than my queue that performs re-balancing. I would recommend putting effort to that version.

Janiczek commented 4 years ago

@Skinney here are the results. bench

Looks like deque and queue have about the same performance now! No clear wins.

deque is better at generating than queue for the elm-test PR :heavy_check_mark:, but not for microthesis :x:
deque is worse at full runs than queue for elm-test :x:, but actually better for microthesis :heavy_check_mark:

Now what I'm worried about is performance of either against elm-test master :grimacing:

elm-explorations / test

Use integrated shrinking instead of type-based shrinking #149