elm-explorations / test

Write unit and fuzz tests for Elm code.
https://package.elm-lang.org/packages/elm-explorations/test/latest
BSD 3-Clause "New" or "Revised" License
237 stars 39 forks source link

Default runs too low #190

Open gampleman opened 2 years ago

gampleman commented 2 years ago

A gripe I've had for a while is that the default runs of 100 is way too low to get decent coverage for most scenarios, and in my experience is pretty low for most test suites.

I recommend usually about 10,000 runs as a base number, then adjust based on desired run time.

I think from a DX perspective having it specified as the number of runs should really be considered as an abstraction leak. Property tests assert that a condition holds for all inputs meeting some criteria; the implementation detail of verifying that assertion is generating a certain number of samples, but the user doesn't necessarily have a great mental model of how many those samples should be (and indeed understanding this requires some fairly non-trivial statistical understanding, as well as knowledge about implementation details of the fuzzers, etc).

Here are some practical suggestions on how to improve this:

  1. Let the user specify (wall-clock) time that they want the test suite to run for. This is nice, since for instance in watch mode we might want to prioritise fast iteration time, in CI we often have other jobs running in parallel so we have a pretty good idea how much "spare" time our tests can take.
  2. Specify a minimum coverage as a percentage (this would make more sense with #188), i.e. we want to validate a certain percentage of the available input space. (Ergonomically this might be nicer to specify in some smaller unit, like 1/1,000,000 or some such). This is nice in the sense that it directly specifies our certainty of not having a bug :)
  3. Have labelling #94 and run enough tests to achieve enough distributions on each label.

A separate issue that could be resolved much more quickly (and is also breaking) is that Test.fuzzWith expects an absolute number of runs. I think this is un-ergonomic, since it's a value one needs to keep messing with. A nicer design would be as a multiplier of the globally configured value. This can be used both for "this test is super slow, so let's not waste too much time testing it" to "this test has highly variable behaviour, so let's spend a lot of our time testing the input space", but let's the test runner also influence the total number of tests to run.

Janiczek commented 2 years ago

I like the point 1) - in fact I'd like to (in addition to saying "run for N runs" and "run for N seconds") be able to say "run indefinitely", although that could probably be emulated with high enough number of runs or seconds.

Re 2) - for integers the space is finite but huge, so there this could work, but what about lists and strings and other collections? Would we arbitrarily decide "the input space is all lists below 50 elements"?

Re 3) I'm not completely sure these are related.

The number of tests needed grows as the the real distribution of the label nears the wanted distribution. (In Hughes' talk the numbers might be made up but anyways, with distributions 4.231% and 5% it took 51200 generated values to verify it will never reach 5%, and with distributions 4.123% and 5% it took 102400 generated values.)

EDIT: actually, now that I read this, it's backwards. We should look into the Haskell code implementing this to understand the relationship better.

This seems different from verification (with some probability p) that the test will never fail. Again I don't know how we'd find the needed number of tests.

The one metric that I believe would be able to tell us whether we've tested the program enough, is code coverage guided generation (like AFL does), perhaps with some symbolic execution sprinkled in. If you went through all the meaningfully different paths (if x < 5 then path1 else path2 only splits the values you need to check into n | n < 5 and n | n >= 5, and inside these categories any value will do), then perhaps you could say "OK we can stop fuzzing, we will not find anything new" with some certainty.

gampleman commented 2 years ago

Yeah that would work only in addition to specifying some memory limit your application has to fit in. If you specify that your application has to fit into i.e. 100mb of RAM, than all data structures are finite.