Closed Janiczek closed 2 years ago
Just to flag: we have a major change on master renaming "Shrinking" to "Simplifying". Presumably we would want to do the rename on minithesis style fuzzers too if we like them. (I like them!)
I had the same thought when I read your code. I'd say let's wait a bit and see how well minithesis works first, and then take a look. It feels a bit odd to have both, and any fuzzers people have written already would have to be manually ported. If it works as well as I think, I'd be more than happy to switch over to it.
This looks like some great work and it should be incorporated into the library. With, as Harry mentioned, the change from shrinking to simplifying.
The current implementation is workable, but it definitely has issues and I'm not worried about replacing it outright. It sounds like David McIver, who wrote the Hypothesis article, has done a lot of thought on this matter I can believe him when he says to do it his way. Also, I've been dropping out of the community (nothing wrong with you folks or the language, "it's not you it's me") so passing the torch sounds like a good idea regardless.
You should know, @Janiczek, that you're signing up to be the person who maintains all of this. I can't give you write access this repo but it's probably time for that to happen.
@mgold Thanks for the kind words!
Who are the current maintainers of this library then? Was it just you, or also @drathier / @rtfeldman / etc? I don't see into the structure, and there is no MAINTAINERS
file or humans.txt
or CODEOWNERS
or any similar file to learn this from.
Also who has the admin/write rights then? Considering the elm-explorations
org this is under, probably @evancz?
I think if I'm about to take over, I would like to share the maintainer responsibility with someone. My thought was to write this generating+shrinking logic again from scratch with somebody, explaining what I've learned when porting that library along the way. (The core of the current Janiczek/elm-minithesis
code is largely done in a "port as closely as possible" way, I'm sure we could find nicer / more idiomatic ways to do things.)
Seems also like a good way to increase bus factor. Would any of the current maintainers / folks related to this repo be interested in doing this with me?
:angel:
I think I'm the only person with write-access who's actively looking at issues here atm, and I'd be happy to continue doing that. I think myself, @harrysarson , @Janiczek and @rtfeldman would be a great team going forward. I'd be more than happy to take a more coordinating role; we've kinda lacked leadership and a way of working for a while now.
Sounds great! @rtfeldman can you confirm your status and thoughts here?
So, would the next step forward with this be to start a PR? Or do we need to answer some questions first, like benchmarks etc?
(If PR :heavy_check_mark: , would you like to pair with me on this or should I do this alone and focus on clear readable commits?)
@Janiczek invited! Also @harrysarson considering how long you've been a contributor to node-test-runner
, I added you while I was at it.
I think the next step is benchmarks. Basically I'd want to avoid the scenario where we do a release with improvements to functionality but various people saying "ahhhhh it's so slow now!" 😄
Welcome aboard!
Thanks Richard!
Thanks Richard!
Sneak peek: for some nested fuzzers like tuple3 (tuple string oneOf) (tuple string oneOf) (tuple string oneOf)
minithesis fully shrunk it in 0.3s where elm-test did not finish in time (shrinking the bottommost string took ~2 seconds for every try; it seems there is some weird asymptotic behaviour going on with the rose tree).
So I have big big hopes for some O(...)
improvements :)
I have a benchmark spreadsheet in progress here: https://docs.google.com/spreadsheets/d/1Yk2smqvaT6KfnNBDdMclBTIsjAnoWnwedI31JmPOrdE/edit?usp=drivesdk
But before I fill it with more examples, could you please look at the benchmark files to see if I'm eg. not comparing apples to oranges? https://github.com/Janiczek/elm-minithesis/tree/master/benchmarks/vs-elm-test
My main worry is that I'm getting better numbers because elm-minithesis with runs = 100
stops after first failure, whereas elm-test with runs = 100
probably tries generating 100 times even if it found a failing example in the first one? Can somebody confirm this hunch?
(If it's correct, I should first make the two libraries behave the same - either way)
If current elm test does keep generating samples even if it finds one that fails then surely this will still mean that users have to wait for all the samples to be generated before the test run completes. If monithesis skips those runs then that is a performance win in and of its self?
There wasn't much rationale to find in the code: https://github.com/elm-explorations/test/blob/master/src/Test/Fuzz.elm#L60
So it's probably "just the way it is", unless @mgold remembers more from the time of implementing this?
Anyways I'll try to change elm-test to only do one failure to get more apples-to-apples comparison, and try to average these benchmarks over multiple seeds, so that one implementation doesn't "get lucky".
Good news everyone!
I present to you the benchmark spreadsheet. Data collected on these branches:
List Int
Array Int
- here is the changeQueue Int
- here is the change
Those were averaged over 5 seeds, but I'm a bit squeamish about the variance especially in the list of ints
test. The results vary wildly based on how "lucky" they get with the seed (11k runs/s at seed 0, 298k runs/s at seed 4 :shrug: ), and I'm not sure average is the right choice here. (Median maybe? Dunno)
I could increase the number of seeds I run this over, but I feel like there's a sensible limit to how many numbers I'm willing to retype into the spreadsheet. Maybe if we could control elm-explorations/benchmark
from within Elm :heart_eyes: we could try with as many seeds as we'd want.
Any ideas? Statistics aren't my strongest suite so I'm not sure about my hunch about the median being a better choice.
Also what do you think about the choice of (always False)
/ (always (Expect.fail "x"))
for the user test function? That makes the shrinker run for all generated values until it gets to some minimal value; I'm not sure if that's a good idea or not.
Perhaps we could have another set of benchmarks for when the user test fn is actually costly? I think more efficient shrinking might shine there (running the user test fn less often).
As you can see, I branched elm-minithesis off into elm-microthesis, which is written in a more "from the ground up", idiomatic Elm style, and the numbers show it. A lot of weird Result wrappers and unnecessary custom types got thrown away, so there's less busywork. I'm quite excited about it.
Any suggestions for a real-world fuzzer test suite that I could try to port over to microthesis, for a more, well, real-world example in the benchmark suite?
My interpretation of the benchmarks:
generation of values in elm-microthesis(queue)
is on par with elm-test
. Which was a surprising result to me after living a few weeks knowing about the elm-minithesis
3x slowdown -- so I'm glad it turned out that way, not complaining! You can see in the unit
case (which is just constant
and least work in all implementations, with no randomness and no shrinking) that elm-test
still beats elm-microthesis
there: elm-microthesis
has more housekeeping costs if you remove all the stuff dealing with randomness.
finding all 100 examples was indeed what slowed down elm-test so much - when I changed it to stop after first found failure, it suddenly became the leader of the pack for shrinking for the fuzzers I tested it on. (But, as seen in the message above, there are some more complicated fuzzers it doesn't even finish on.)
Unsure what the triple ints
slowdown for shrinking is about for Microthesis. My hunch is that it's less efficient to do
map3 : (a -> b -> c -> d) -> Generator a -> Generator b -> Generator c -> Generator d
map3 fn a b c =
constant fn
|> andMap a
|> andMap b
|> andMap c
than to expand and thread it fully like
map3 : (a -> b -> c -> d) -> Generator a -> Generator b -> Generator c -> Generator d
map3 fn (Generator generatorA) (Generator generatorB) (Generator generatorC) =
Generator <|
\prng ->
case generatorA prng of
Generated a ->
case generatorB a.prng of
Generated b ->
case generatorC b.prng of
Generated c ->
Generated
{ value = fn a.value b.value c.value
, prng = c.prng
}
Rejected r ->
Rejected r
Rejected r ->
Rejected r
Rejected r ->
Rejected r
I'll benchmark :grin:
EDIT: roads not travelled:
Any feedback on benchmarks?
Do you think it's a good time to try to replace the current fuzz/shrink code of elm-test with the microthesis(queue) one on a branch / in my elm-test fork?
After reimplementing map3
: generation got better (quite a lot), shrinking is mostly the same
(I didn't commit these changes to not mess with the tags and branches for benchmarks above, but it's literally just that map3
reimplementation. I suppose other mapN
functions would also benefit from the same thing.)
Based on comments by @turboMaCk and @Skinney I tried a deque implementation. Seems worse than queue for generation but better for shrinking.
@Janiczek I'm pretty certain I can significantly improve the performance of several deque
functions (like left
, dropLeft
), as well introduce a few functions (like get
, set
, compare
) that would be beneficial to you. I can probably get it done in about a week.
Would that be of interest?
@Skinney I would certainly use them if they were present! :heart:
Don't want to put any pressure on you though. Swapping the implementations is pretty easy to do so we can go with queue
one for now and then re-benchmark+switch after any new deque
release. So this issue IMHO isn't blocked by it, strictly speaking.
@Janiczek No pressure at all! I get a kick out of stuff like this. Just wanted to make sure you hadn't already made up your mind about not using a deque.
Quick question. It seems generation (as opposed to gen+shrink) only makes use of pushBack
, is that right? If so, only the gen+shrink
case can be improved as pushFront
and pushBack
are as fast as they can reasonably be.
@Skinney yes, in the tweet I mistakenly attributed popFront
to generation but it's only used in shrinking.
Shrinking makes use of popFront
, left
, either dropLeft
or right
(right
seemed more performant when I compared the two so I kept it), and would benefit from (non-existing) repeat
, get
, set
and perhaps slice
.
@Janiczek Just in case you feel it might make more sense to stick to FIFO let me know if I should extend the API. I'm not sure if I would be able to optimize the implementation further since It's been a while since I opened that project but feel free to just open issues or ping me directly.
@turboMaCk what I was missing in the queue implementation was slice
, left
/ take
, right
/ dropLeft
and concat
/ append
, get
and set
. But I'm not sure those are even idiomatic operations for the data structure (same thing applies to deque, I feel).
Essentially you can look at the code and search for the Queue.toList
escape hatch.
I think it would at least make sense to provide enough so you can define all of these without need for toList
which is quite expensive operation.
@Janiczek Yeah. Ideally, you'd want an RRBT implementation of Array, which I sadly never got around to implement (I have four unmerged PRs in elm/core that are a year old now, so I wouldn't count on it happening any time soon either).
@Janiczek Here you go: https://package.elm-lang.org/packages/Skinney/elm-deque/latest/
left
, right
, dropLeft
and dropRight
are now 4x faster
Introduced initialize
,repeate
and range
functions for faster construction of deques. Should also be 4-5x faster than List.repeat >> Deque.fromList
.
I didn't get around to get
and set
, that will have to come at a later time.
I explored for a long while how to make left
, right
, dropLeft
, dropRight
as fast as append
, and while I think I know how, it's going to take a while.
In any case, it should be really interesting to see how this latest version affects your benchmark :)
@Skinney Wow! I'm gonna try that, will post results here.
I think @Skinney's implementation is way to go btw. For case as complex as this I think it is better suited than my queue that performs re-balancing. I would recommend putting effort to that version.
@Skinney here are the results.
Looks like deque
and queue
have about the same performance now! No clear wins.
deque
is better at generating than queue
for the elm-test PR :heavy_check_mark:, but not for microthesis :x:deque
is worse at full runs than queue
for elm-test :x:, but actually better for microthesis :heavy_check_mark: Now what I'm worried about is performance of either against elm-test master
:grimacing:
TL;DR: I have a working Elm code for fuzzers that shrink automatically (no user-defined shrinkers needed) and allow for non-problematic
andThen
. I'd like to getelm-explorations/test
to use it instead of the current QuickCheck-like shrinking.Also: https://discourse.elm-lang.org/t/elm-minithesis-shrinking-without-compromises/6071
Also for differences between the two types of shrinking: https://hypothesis.works/articles/integrated-shrinking/
Hello!
Recently @'drmaciver has created a minimalistic version of Hypothesis called Minithesis.
(As you might know, Hypothesis is a property-based testing library that uses a different approach to shrinking which allows it to use
andThen
without issues etc.)I've ported it to Elm: elm-minithesis.
My question is: would you be up to having it or your own implementation of it in
elm-explorations/test
? IMHO there are benefits to it; I still have to benchmark / test / compareelm-minithesis
againstelm-explorations/test
to realize where the drawbacks are.PS: Having one testing library instead of many is, I think, a nice property of the Elm ecosystem, so even if you didn't want to incorporate this approach into
elm-explorations/test
, I most likely won't be creating a new CLI tool from the ground up, and instead will just publish it as an Elm package which will interop withelm-explorations/test
as seamlessly as possible.