My task this week-end was to provide the necessary function to do properties-based testing, in the style of

forall (m : map) k v, find k (insert k v m) = v

I have experimented with various things (again), but the problem is quite tricky and I'm finally converging towards the idea of keeping things simple and specific. Below is an explanation of the issues at hand, and then a compilation of (a part of) the relevant bibliography, which more or less confirms that the state of the art (with one notable exception) has no consensual general solution to these problems.

Property language: general or specific?

PL researchers like well-specified, general and expressive description languages. In the case of property-based testing, the most natural idea is to accept any first-order logic formula (true/false, and/or/not, forall/exists) as a property to check. The general question is whether we can give a satisfying semantics to the complete language, or if there is a well-defined subclass we should restrict ourselves to.

General language

There are various immediate problems with the idea of taking the full first-order logic. For example, it's not exactly clear what it means to "test" a property of the form ∃(x:t). P(x). Should you generate 100 elements at type t, and declare that there does not exist such a x if we haven't found any? The usual idea about property-based testing is that if the tester returns "no", then there must be a bug in the program (but a "yes" is no proof, only added confidence); this interpretation of existential would break it, as it only means we haven't tested long enough.

Even if you rule out existential, "or" is still a source of trouble: if you test (forall(x:t) P[x]) ‌\/ Q, and you don't find a counter-example on the left-hand-side, does it mean that the whole formula should evaluate to "true", or should one also try the right-hand-side?

This suggests that to do things properly, one should take into account the following ingredients:

there is a difference between "no counter-example found" and "proved to be true", and conversely "no witness found" and "proved to be false", and a property evaluator should keep track of both truth value and "certainty" in the result to behave as a user would expect
relatedly, one should make a distinction between quantification on datatypes that are finite, and those whose elements are sampled at random. You can return a "certainly true" answer to the property ∃(b:bool) P[b], you cannot on ∃(n:int) P[n].

I started implementing those sublteties and, unsurprisingly, it grew long and overly complicated. I think it's interesting to push further, but not the main point of Articheck, and not reasonable in our time frame.

Simple restricted language : generator + precondition + test

The simplest restricted language that comes to mind is to restrict to properties defined as a triple of a list of generator/variables, a precondition, and a quantifier-free, implication-free "test" formula (just a boolean expression):

forall (x1:t1) (x2:t2) ... (xn:tn),
   precondition[x1,..,xn] => test[x1,..,xn]

It is important to distinguish the precondition as the implication in the formula because the usual boolean interpretation of implication is not satisfying: if a random generation of the variables fails the precondition, we should not count it as a passed test, because it doesn't actually increase our confidence in the general result (consider: well-typed[x] => reduces-to-value[x], it's easy to randomly generate 100 x that are not well-typed, but it doesn't tell you anything about the property). The test is counted as "discarded" (and we should keep track of them to alert the user if we make too many discarded draws without being able to run the test; please reformulate the property in a more efficient way). The user is free to use implications in the test part, but they won't have this specific discarding semantics.

Perhaps surprisingly, this is in fact the restriction used by most of the related work I've seen. I suggest that we stick to it.

Relaxed restriction: nested preconditions

Admitting that positive connectives (existential and or) are awkward in testing settings, could we at least let user interleave forall-quantification and implications? There are situations where one may want to write

forall (t:term), well-scoped t =>
  forall (σ:type), well-scoped σ ∧ type-checks t σ =>
    reduces-to-value(t)

instead of

forall (t:term) (σ:type),
  well-scoped t ∧ well-scoped σ ∧ type-checks t σ =>
    reduces-to-value(t)

If we know that most randomly drawn terms are not well-scoped, it's silly to still draw a type each time before discarding the test anyway.

While it seems reasonable, this relaxed sublanguage also comes with its share of non-trivial questions. Consider you have draw a term t that is well-scoped, but the type you draw is not a correct type for t. Should you keep the same term, and try to draw a new type or start from the toplevel again? If it is very hard to find a well-scoped term, throwing away this good candidate is silly. But reusing it too much affect the probability space; what if this term actually cannot be typed?

It is quite clear that there is a notion of "fairness" that should determine an exploration order for the state space. For example, if we consider that we draw elements in the cartesian product set

{t:term | well-scoped t} × {σ:type | well-scoped σ}`

, there are known techniques (diagonal traversal, etc.) to enumerate pairs of this set in a "fair" way instead of of, for example, picking a single t and returning all the {(t,σ) | σ:type} at first.

One possible way would be, for example, to determine a starting amount of "fuel" for the whole formula. Given N units of fuel, a quantifier (random generator) would pick a number of tries t and (uniformly) subfuel values amounts n1,..,nt, such that the sum of all subfuels is N; then draw t different values randomly, and give to each i-th subformula the amount of fuel ni to itself draw values and make tests.

Synthesis

It seems unrealistic to capture the full first-order logic as a property description language. The safe choice is the restricted language, with a specific semantics for precondition implication. A relaxation of the restricted language with nested implications and forall-quantifiers would be nice, but would require non-trivial work.

When I realized the issue, I decided to look into the literature for guidance on the problem. Surely someone had solved it by proposing a good semantics for testing of full first-order formulas? I include my synthesis of what I looked at below, but the short answer is "no, except maybe Quickcheck/Isabelle".

Related work

Functional Testing in the Focal Environment

Matthieu Carlier, Catherine Dubois, 2008

This article presents the generation and test case execution under the framework Focal. In the programming language Focal, all prop- erties of the program are written within the source code. These properties are considered, here, as the program specification. We are interested in testing the code against these properties. Testing a property is split in two stages. First, the property is cut out in several elementary properties. An elementary property is a tuple composed of some pre-conditions and a conclusion. Lastly, each elementary property is tested separately. The pre-conditions are used to generate and select the test cases randomly. The conclusion allows us to compute the verdict. All the testing process is done automatically.

Properties are split into "elementary properties", restricted to a plain structure of the form

forall variables, preconditions => test

Finding Counter Examples in Induction Proofs

Koen Claessen, Hans Svensson, 2008

This paper addresses a problem arising in automated proof of invariants of transition systems, for example transition systems mod- elling distributed programs. Most of the time, the actual properties we want to prove are too weak to hold inductively, and auxiliary invariants need to be introduced. The problem is how to find these extra invari- ants. We propose a method where we find minimal counter examples to candidate invariants by means of automated random testing techniques. These counter examples can be inspected by a human user, and used to adapt the set of invariants at hand. We are able to find two different kinds of counter examples, either indicating (1) that the used invariants are too strong (a concrete trace of the system violates at least one of the invariants), or (2) that the used invariants are too weak (a concrete transition of the system does not maintain all invariants). We have developed and evaluated our method in the context of formally verifying an industrial-strength implementation of a fault-tolerant distributed leader election protocol

Domain-specific notion of "adaptation" to generate elements that may be counter-examples (are likely to verify the preconditions, but falsify the test).

Relevant quote:

Instead, we implemented a test data generator generator. Given a first-order formula φ, our generator-generator automatically constructs a random test data generator which generates states that are very likely to fulfill φ. So, instead of manually writing a generator for each invariant Invi we use the generator-generator to generate one. We then use the resulting generator in QuickCheck to check that the property holds.

Testing First-Order Logic Axioms in Program Verification

Ki Yung Ahn, Ewen Denney, 2010 (journal version 2013)

Program verification systems based on automated theorem provers rely on user-provided axioms in order to verify domain-specific properties of code. However, formulating axioms correctly (that is, formalizing properties of an intended mathematical interpretation) is nontrivial in practice, and avoiding or even detecting unsoundness can sometimes be difficult to achieve. Moreover, speculating soundness of axioms based on the output of the provers themselves is not easy since they do not typically give counterexamples. We adopt the idea of model-based testing to aid axiom authors in discovering errors in axiomatizations. To test the validity of axioms, users define a computational model of the axiomatized logic by giving interpretations to the function symbols and constants in a simple declarative programming language. We have developed an axiom testing framework that helps automate model definition and test generation using off-the-shelf tools for meta-programming, property-based random testing, and constraint solving. We have experimented with our tool to test the axioms used in AutoCert, a program verification system that has been applied to verify aerospace flight code using a first-order axiomatization of navigational concepts, and were able to find counterexamples for a number of axioms.

An interesting toolchain:

We have implemented a tool using Template Haskell [7], QuickCheck [8], and Yices [6], as illustrated in Figure 3. An axiom in TPTP syntax is parsed and automatically translated into a lambda term using Template Haskell. Using a metaprogramming language like Template Haskell has the advantage that we inherit the underlying type system for the interpreted terms. We generate a Haskell function rather than implement an evaluator for the logic. During the translation we transform the logical formula into a “PNF (prenex normal form) like” form moving universally quantified variables to the top level as much as possible. [...] We need to lift all the variables to the top level universal quantification to interpret the axioms as executable functions.

The next step is to use a combination of QuickCheck library random generators and Yices to automatically synthesize test generators that generate inputs satisfying the premises of the axiom formula, thus avoiding vacuous tests. In the case where Yices cannot solve the constraints, we use our own smart generators with the help of combinator libraries in QuickCheck. We sometimes need to patch or fill in unconstrained values that are missing from the results of Yices. For example, among the four variables in the axiom update_last_in_range, X and Y are unconstrained since they do not appear in the premise, so we randomly generate X and Y independently from Yices. Sometimes, there can be unconstrained variables even if they do appear in the premise because the constraints on those variables are trivial (e.g., solutions for x satisfying x + 0 = x). Then, we can invoke QuickCheck over the property combined with the test generator. The basic idea is to call quickCheck (forAll generator property), where generator generates the test data and property is the property to test

Random testing in Isabelle/HOL

Stefan Berghofer, Tobias Nipkow, 2004

(If you decide to look at one of those articles, pick this one.)

When developing non-trivial formalizations in a theorem prover, a considerable amount of time is devoted to “debugging” specifications and conjectures by failed proof attempts. To detect such problems early in the proof and save development time, we have extended the Isabelle theorem prover with a tool for testing specifications by evaluating propositions under an assignment of random values to free variables. Distribution of the test data is optimized via mutation testing. The technical contributions are an extension of earlier work with inductive definitions and a generic method for randomly generating elements of recursive datatypes.

Details a way to generate random generators from inductive type definitions, such that the length of generated values is an uniform random variable -- they were careful about that.

In Isabelle, functions are often defined relationally, so the article details how to use a mode analysis (which assigns an "input" and "output" role to the various arguments of the relation) to generate computable ML code from an inductive predicate.

A surprising and interesting idea is to model logic formulas (those we want to test) also as inductive predicates -- interestingly, Christine Paulin-Mohring is credited for first using this idea in the context of proof assistants. This suggests that free variables in a formula may be given modes, instead of only being considered as output.

Traditionally we see a formula F[x,y,z] as meaning "please give me randomly generated x, y and z, and I'll return you True of False": variables are all input, the output is boolean. But we may also consider the meaning: generate x and y and, assuming that the formula returns False, please give me a corresponding z (if you want, a counter-example of F[x,y,?] = True).

This point of view helps for the "precondition" problem: why is it that we want to give a specific status to the formulas in "precond" of the form precond[x] => test? When precond[x] is false, we want to re-generate another x, but don't want to count it as a successful test (consider: ). Berghofer and Nipkow suggest to change the mode of precond[x] to reflect this point of view: you should consider that x is the output of the query precond[x] which must be satisfied, not the input.

(Remark: Nipkow later improved over this Quicheck-in-Isabelle by designing Nitpick, that uses a sophisticated model-checker to generate counter-examples, instead of random generation. I think the Nitpick paper ("Nitpick: A Counterexample Generator for Higher-Order Logic Based on a Relational Model Finder", by Jasmin Christian Blanchette and Tobias Nipkow, 2010) is however less relevant to Articheck as it exists today, as it is mainly about massaging higher-order logic in a form that a SAT solver can understand -- including coinductives, etc.)

braibant / articheck

On the expression of properties as first-order formula; attempts, direction and Related Work #24