elm-explorations / test

Write unit and fuzz tests for Elm code.

https://package.elm-lang.org/packages/elm-explorations/test/latest

BSD 3-Clause "New" or "Revised" License

236 stars 40 forks source link

Skip values (moreso, RandomRuns) we have already tested #207

Open Janiczek opened 1 year ago

Janiczek commented 1 year ago

WIP

closes #11 without need for major version since we don't add a new Failure variant.

Allows FuzzOptions.runs * 2 skips of values already seen. This, ironically, speeds up our fuzzer distribution machinery as well.

Exhaustive checking (#188) would help us save even more unneeded work (we'd know we can stop skipping) -- and I want to do that -- but I'm not trying that in this PR. The skipping works reasonably well even without it.

Should be working already! Now just measuring whether this has any noticeable impact, with tests/randomized-tests.sh.

Janiczek commented 1 year ago

Do we have any concerns about memory use, particularly for large run count scenarios?

I can try to measure that.

For now here are runtime counts: I've tried over multiple repos and the result is roughly the same everywhere: skipping doesn't give any performance benefit (it adds an overhead if anything), all it does is give you more diverse inputs.

gampleman commented 1 year ago

What number of runs are those charts for?

Janiczek commented 1 year ago

@gampleman 50 runs per configuration

gampleman commented 1 year ago

I wonder if 50 is too low to:

a) manifest all that much duplication b) exert significant memory pressure from potentially large sets of generated values

Could we also test with say 10,000 runs?

Janiczek commented 1 year ago

Could we also test with say 10,000 runs?

Each run of the elm-test test suite (where various fuzzers differ, some have runs=100, some have runs=10000) is about 10s on average. Getting 10k test suite runs would take me 27 hours straight (unless somehow parallelized)

EDIT: sorry if I previously led you to believe the number 50 was for the runs=... configuration! Basically what the charts above are is: I run something akin to for SEED in {1..50}; do time elm-test --seed=$SEED; done; and collect the numbers into a table. I do this for various configurations (master code, this PR with multiplier 1,2,5,10) to compare the effect on runtime across a large test suite.

gampleman commented 1 year ago

Ah OK, that makes sense. I was a bit surprised you were doing fewer than the default 100...

Janiczek commented 1 year ago

Out of curiosity I'll try another set of 50 seeds * {master,skip01,skip02,skip05,skip10} with the --fuzz=10000 option active, to see whether perhaps there is some difference there. But I'd only expect performance savings where the overhead of generating+skipping extra values would be lower than that of running the test on the would-be-skipped values -- this PR is not about short-circuiting 🙂

Janiczek commented 1 year ago

Tried the same thing with defaultRuns = 10000, as said above.

So eg. in test suite runs taking ~18s the skip x2 approach added ~1.5s to the runtime.

gampleman commented 1 year ago

Yeah I think we would need some nice test for the F-metric to be able to also see the upside of a PR like this. FWIW I don't think that's a too bad perf degradation, and ultimately helps with the mission of actually finding bugs, so 👍

Janiczek commented 1 year ago

F-metric

I think I vaguely recall this mentioned in our discussion of quasirandom numbers? I'm a stats noob, could you please share some Wikipedia/paper links for its definition etc.? My googling is returning F-score (not sure it is what you're talking about) and F-metric being some kind of web development stuff (that definitely isn't it) :)

gampleman commented 1 year ago

It's a horrible name, since it's almost impossible to google for. I mentioned it in [here]():

[F-metric is] defined as the number of test cases the test system needs to generate before a defect is uncovered

So I suspect we could make a benchmark of Fuzz tests that fail in various circumstances and then run them with a high number of runs, and report out how many runs it took to find each failure condition. I suspect this present PR would then present a potentially substantial improvement on such a benchmark.

You could also not report out runs but the time it took to find each bug. Then that would be a fairly sensible performance metric, since that's in some ways more meaningful than how long it takes to generate and run some arbitrary number of cases.