Open gampleman opened 2 years ago
I like the point 1) - in fact I'd like to (in addition to saying "run for N runs" and "run for N seconds") be able to say "run indefinitely", although that could probably be emulated with high enough number of runs or seconds.
Re 2) - for integers the space is finite but huge, so there this could work, but what about lists and strings and other collections? Would we arbitrarily decide "the input space is all lists below 50 elements"?
Re 3) I'm not completely sure these are related.
The number of tests needed grows as the the real distribution of the label nears the wanted distribution. (In Hughes' talk the numbers might be made up but anyways, with distributions 4.231%
and 5%
it took 51200 generated values to verify it will never reach 5%, and with distributions 4.123%
and 5%
it took 102400 generated values.)
EDIT: actually, now that I read this, it's backwards. We should look into the Haskell code implementing this to understand the relationship better.
This seems different from verification (with some probability p) that the test will never fail. Again I don't know how we'd find the needed number of tests.
The one metric that I believe would be able to tell us whether we've tested the program enough, is code coverage guided generation (like AFL does), perhaps with some symbolic execution sprinkled in. If you went through all the meaningfully different paths (if x < 5 then path1 else path2
only splits the values you need to check into n | n < 5
and n | n >= 5
, and inside these categories any value will do), then perhaps you could say "OK we can stop fuzzing, we will not find anything new" with some certainty.
Yeah that would work only in addition to specifying some memory limit your application has to fit in. If you specify that your application has to fit into i.e. 100mb of RAM, than all data structures are finite.
A gripe I've had for a while is that the default runs of 100 is way too low to get decent coverage for most scenarios, and in my experience is pretty low for most test suites.
I recommend usually about 10,000 runs as a base number, then adjust based on desired run time.
I think from a DX perspective having it specified as the number of runs should really be considered as an abstraction leak. Property tests assert that a condition holds for all inputs meeting some criteria; the implementation detail of verifying that assertion is generating a certain number of samples, but the user doesn't necessarily have a great mental model of how many those samples should be (and indeed understanding this requires some fairly non-trivial statistical understanding, as well as knowledge about implementation details of the fuzzers, etc).
Here are some practical suggestions on how to improve this:
A separate issue that could be resolved much more quickly (and is also breaking) is that
Test.fuzzWith
expects an absolute number of runs. I think this is un-ergonomic, since it's a value one needs to keep messing with. A nicer design would be as a multiplier of the globally configured value. This can be used both for "this test is super slow, so let's not waste too much time testing it" to "this test has highly variable behaviour, so let's spend a lot of our time testing the input space", but let's the test runner also influence the total number of tests to run.