Open Janiczek opened 1 year ago
Do we have any concerns about memory use, particularly for large run count scenarios?
I can try to measure that.
For now here are runtime counts: I've tried over multiple repos and the result is roughly the same everywhere: skipping doesn't give any performance benefit (it adds an overhead if anything), all it does is give you more diverse inputs.
What number of runs are those charts for?
@gampleman 50 runs per configuration
I wonder if 50 is too low to:
a) manifest all that much duplication b) exert significant memory pressure from potentially large sets of generated values
Could we also test with say 10,000 runs?
Could we also test with say 10,000 runs?
Each run of the elm-test test suite (where various fuzzers differ, some have runs=100
, some have runs=10000
) is about 10s on average. Getting 10k test suite runs would take me 27 hours straight (unless somehow parallelized)
EDIT: sorry if I previously led you to believe the number 50 was for the runs=...
configuration!
Basically what the charts above are is: I run something akin to for SEED in {1..50}; do time elm-test --seed=$SEED; done;
and collect the numbers into a table. I do this for various configurations (master
code, this PR with multiplier 1,2,5,10) to compare the effect on runtime across a large test suite.
Ah OK, that makes sense. I was a bit surprised you were doing fewer than the default 100...
Out of curiosity I'll try another set of 50 seeds * {master,skip01,skip02,skip05,skip10}
with the --fuzz=10000
option active, to see whether perhaps there is some difference there. But I'd only expect performance savings where the overhead of generating+skipping extra values would be lower than that of running the test on the would-be-skipped values -- this PR is not about short-circuiting 🙂
Tried the same thing with defaultRuns = 10000
, as said above.
So eg. in test suite runs taking ~18s the skip x2
approach added ~1.5s to the runtime.
Yeah I think we would need some nice test for the F-metric to be able to also see the upside of a PR like this. FWIW I don't think that's a too bad perf degradation, and ultimately helps with the mission of actually finding bugs, so 👍
F-metric
I think I vaguely recall this mentioned in our discussion of quasirandom numbers? I'm a stats noob, could you please share some Wikipedia/paper links for its definition etc.? My googling is returning F-score (not sure it is what you're talking about) and F-metric being some kind of web development stuff (that definitely isn't it) :)
It's a horrible name, since it's almost impossible to google for. I mentioned it in [here]():
[F-metric is] defined as the number of test cases the test system needs to generate before a defect is uncovered
So I suspect we could make a benchmark of Fuzz tests that fail in various circumstances and then run them with a high number of runs
, and report out how many runs it took to find each failure condition. I suspect this present PR would then present a potentially substantial improvement on such a benchmark.
You could also not report out runs
but the time it took to find each bug. Then that would be a fairly sensible performance metric, since that's in some ways more meaningful than how long it takes to generate and run some arbitrary number of cases.
WIP
closes #11 without need for major version since we don't add a new Failure variant.
Allows
FuzzOptions.runs * 2
skips of values already seen. This, ironically, speeds up our fuzzer distribution machinery as well.Exhaustive checking (#188) would help us save even more unneeded work (we'd know we can stop skipping) -- and I want to do that -- but I'm not trying that in this PR. The skipping works reasonably well even without it.
Should be working already! Now just measuring whether this has any noticeable impact, with
tests/randomized-tests.sh
.