Open gasche opened 10 years ago
I really like this nice summary of the literature. I believe that sticking to the restricted framework is ok for this work, and that part of this discussion could make it into the related work discussion (saying: this is an hard problem, and we do not attempt to solve it here).
I also think that this could make a good gagallium blog post.
Thomas
My task this week-end was to provide the necessary function to do properties-based testing, in the style of
I have experimented with various things (again), but the problem is quite tricky and I'm finally converging towards the idea of keeping things simple and specific. Below is an explanation of the issues at hand, and then a compilation of (a part of) the relevant bibliography, which more or less confirms that the state of the art (with one notable exception) has no consensual general solution to these problems.
Property language: general or specific?
PL researchers like well-specified, general and expressive description languages. In the case of property-based testing, the most natural idea is to accept any first-order logic formula (true/false, and/or/not, forall/exists) as a property to check. The general question is whether we can give a satisfying semantics to the complete language, or if there is a well-defined subclass we should restrict ourselves to.
General language
There are various immediate problems with the idea of taking the full first-order logic. For example, it's not exactly clear what it means to "test" a property of the form
∃(x:t). P(x)
. Should you generate 100 elements at typet
, and declare that there does not exist such ax
if we haven't found any? The usual idea about property-based testing is that if the tester returns "no", then there must be a bug in the program (but a "yes" is no proof, only added confidence); this interpretation of existential would break it, as it only means we haven't tested long enough.Even if you rule out existential, "or" is still a source of trouble: if you test
(forall(x:t) P[x]) \/ Q
, and you don't find a counter-example on the left-hand-side, does it mean that the whole formula should evaluate to "true", or should one also try the right-hand-side?This suggests that to do things properly, one should take into account the following ingredients:
∃(b:bool) P[b]
, you cannot on∃(n:int) P[n]
.I started implementing those sublteties and, unsurprisingly, it grew long and overly complicated. I think it's interesting to push further, but not the main point of Articheck, and not reasonable in our time frame.
Simple restricted language : generator + precondition + test
The simplest restricted language that comes to mind is to restrict to properties defined as a triple of a list of generator/variables, a precondition, and a quantifier-free, implication-free "test" formula (just a boolean expression):
It is important to distinguish the precondition as the implication in the formula because the usual boolean interpretation of implication is not satisfying: if a random generation of the variables fails the precondition, we should not count it as a passed test, because it doesn't actually increase our confidence in the general result (consider:
well-typed[x] => reduces-to-value[x]
, it's easy to randomly generate 100 x that are not well-typed, but it doesn't tell you anything about the property). The test is counted as "discarded" (and we should keep track of them to alert the user if we make too many discarded draws without being able to run the test; please reformulate the property in a more efficient way). The user is free to use implications in the test part, but they won't have this specific discarding semantics.Perhaps surprisingly, this is in fact the restriction used by most of the related work I've seen. I suggest that we stick to it.
Relaxed restriction: nested preconditions
Admitting that positive connectives (existential and or) are awkward in testing settings, could we at least let user interleave forall-quantification and implications? There are situations where one may want to write
instead of
If we know that most randomly drawn terms are not well-scoped, it's silly to still draw a type each time before discarding the test anyway.
While it seems reasonable, this relaxed sublanguage also comes with its share of non-trivial questions. Consider you have draw a term
t
that is well-scoped, but the type you draw is not a correct type fort
. Should you keep the same term, and try to draw a new type or start from the toplevel again? If it is very hard to find a well-scoped term, throwing away this good candidate is silly. But reusing it too much affect the probability space; what if this term actually cannot be typed?It is quite clear that there is a notion of "fairness" that should determine an exploration order for the state space. For example, if we consider that we draw elements in the cartesian product set
, there are known techniques (diagonal traversal, etc.) to enumerate pairs of this set in a "fair" way instead of of, for example, picking a single
t
and returning all the{(t,σ) | σ:type}
at first.One possible way would be, for example, to determine a starting amount of "fuel" for the whole formula. Given N units of fuel, a quantifier (random generator) would pick a number of tries
t
and (uniformly) subfuel values amountsn1,..,nt
, such that the sum of all subfuels is N; then drawt
different values randomly, and give to eachi
-th subformula the amount of fuelni
to itself draw values and make tests.Synthesis
It seems unrealistic to capture the full first-order logic as a property description language. The safe choice is the restricted language, with a specific semantics for precondition implication. A relaxation of the restricted language with nested implications and forall-quantifiers would be nice, but would require non-trivial work.
When I realized the issue, I decided to look into the literature for guidance on the problem. Surely someone had solved it by proposing a good semantics for testing of full first-order formulas? I include my synthesis of what I looked at below, but the short answer is "no, except maybe Quickcheck/Isabelle".
Related work
Functional Testing in the Focal Environment
Matthieu Carlier, Catherine Dubois, 2008
Properties are split into "elementary properties", restricted to a plain structure of the form
forall variables, preconditions => test
Finding Counter Examples in Induction Proofs
Koen Claessen, Hans Svensson, 2008
Domain-specific notion of "adaptation" to generate elements that may be counter-examples (are likely to verify the preconditions, but falsify the test).
Relevant quote:
Ki Yung Ahn, Ewen Denney, 2010 (journal version 2013)
An interesting toolchain:
Stefan Berghofer, Tobias Nipkow, 2004
(If you decide to look at one of those articles, pick this one.)
Details a way to generate random generators from inductive type definitions, such that the length of generated values is an uniform random variable -- they were careful about that.
In Isabelle, functions are often defined relationally, so the article details how to use a mode analysis (which assigns an "input" and "output" role to the various arguments of the relation) to generate computable ML code from an inductive predicate.
A surprising and interesting idea is to model logic formulas (those we want to test) also as inductive predicates -- interestingly, Christine Paulin-Mohring is credited for first using this idea in the context of proof assistants. This suggests that free variables in a formula may be given modes, instead of only being considered as output.
Traditionally we see a formula
F[x,y,z]
as meaning "please give me randomly generated x, y and z, and I'll return you True of False": variables are all input, the output is boolean. But we may also consider the meaning: generate x and y and, assuming that the formula returns False, please give me a corresponding z (if you want, a counter-example ofF[x,y,?] = True
).This point of view helps for the "precondition" problem: why is it that we want to give a specific status to the formulas in "precond" of the form
precond[x] => test
? Whenprecond[x]
is false, we want to re-generate anotherx
, but don't want to count it as a successful test (consider: ). Berghofer and Nipkow suggest to change the mode ofprecond[x]
to reflect this point of view: you should consider thatx
is the output of the queryprecond[x]
which must be satisfied, not the input.(Remark: Nipkow later improved over this Quicheck-in-Isabelle by designing Nitpick, that uses a sophisticated model-checker to generate counter-examples, instead of random generation. I think the Nitpick paper ("Nitpick: A Counterexample Generator for Higher-Order Logic Based on a Relational Model Finder", by Jasmin Christian Blanchette and Tobias Nipkow, 2010) is however less relevant to Articheck as it exists today, as it is mainly about massaging higher-order logic in a form that a SAT solver can understand -- including coinductives, etc.)