Shinmera / parachute

An extensible and cross-compatible testing framework.
https://shinmera.github.io/parachute
zlib License
94 stars 9 forks source link

Improve generated and random testing support #18

Open Shinmera opened 4 years ago

Shinmera commented 4 years ago

Parachute should support testing where the tests are runtime (possibly randomly) generated better. In particular, this would involve:

Shinmera commented 1 year ago

Added largescale report to address point 1.

fstamour commented 1 year ago

Ah! I came here to create this issue :P

I'll come back to add details, ideas and a plan.


Ok, quick look at the readme. I think a new type of test make sense (I'll confirm once I've looked at the implementation.)

fstamour commented 1 year ago

Here's the current behaviour, to help highlight what needs improvment:

(defpackage #:randomized-testing
  (:documentation "Scratch file to explore randomized tests with parachute")
  (:use #:cl)
  (:import-from #:parachute
                #:define-test
                #:define-test+run
                #:is
                #:true
                #:false))

(in-package #:randomized-testing)

;; Define a dummy function to test
(defun buggy (x)
  "Like cl:identity, but returns the wrong value for 2."
  (if (= 2 x) 3 x))

;; Define a dummy PRGN
(defun make-fake-prng ()
  (let ((x 0))
    (lambda ()
      (incf x))))

;; That's what a radomized test would like _with the current implementation_
(define-test+run buggy
  (let ((prng (make-fake-prng)))
    (loop
      :repeat 10                 ; we run the randomized test 10 times
      :for x = (funcall prng)
      :do (is = x (buggy x)))))

;; Here's what the output looks like:

#|
? RANDOMIZED-TESTING::BUGGY
0.000 ✔   (is = x (buggy x))
0.000 ✘   (is = x (buggy x))
0.000 ✔   (is = x (buggy x))
0.000 ✔   (is = x (buggy x))
0.000 ✔   (is = x (buggy x))         ; These are not useful
0.000 ✔   (is = x (buggy x))         ; They are not descriptive either
0.000 ✔   (is = x (buggy x))
0.000 ✔   (is = x (buggy x))
0.000 ✔   (is = x (buggy x))
0.000 ✔   (is = x (buggy x))
0.004 ✘ RANDOMIZED-TESTING::BUGGY

;; Summary:
Passed:     9                        ;  Should we count them differently?
Failed:     1
Skipped:    0

;; Failures:
1/  10 tests failed in RANDOMIZED-TESTING::BUGGY  ; What about this count?
The test form   (buggy x)            ; For more complex tests, this would not be
evaluated to    3                    ; enough to know which inputs failed the test
when            2
was expected to be equal under =.

#<PARACHUTE:PLAIN 11, FAILED results>
((IS = X (BUGGY X)))
|#

The user can improve the situation a bit, by supplying a description to the test result:

(define-test+run buggy
  (let ((prng (make-fake-prng)))
    (loop
      :repeat 10 ; we run the randomized test 10 times
      :for x = (funcall prng)
-     :do (is = x (buggy x)))))
+     :do (is = x (buggy x) "Failed with input ~a" x))))

Then the output would be slightly improved:

[...]

;; Summary:
Passed:     9
Failed:     1
Skipped:    0

;; Failures:
1/  10 tests failed in RANDOMIZED-TESTING::BUGGY
The test form   (buggy x)
evaluated to    3
when            2
was expected to be equal under =.
+Failed with input 2

[...]

Edit based on Shinmera's comment about uax-9

If we expand the is macro, the code looks like this:

(define-test+run buggy
  (let ((prng (make-fake-prng)))
    (loop
      :repeat 10 ; we run the randomized test 10 times
      :for x = (funcall prng)
      :do (eval-in-context *context*
                 (make-instance 'comparison-result
                                :expression '(is = x (buggy x))
                                :value-form '(buggy x)
                                :body (lambda () (buggy x))
                                :expected x
                                :comparison '=)))))

And if we tweak it just a little bit:

(define-test+run buggy
  (let ((prng (make-fake-prng)))
    (loop
      :repeat 10 ; we run the randomized test 10 times
      :for x = (funcall prng)
      :do (eval-in-context *context*
                 (make-instance 'comparison-result
-                               :expression '(is = x (buggy x))
-                               :value-form '(buggy x)
+                               :expression `(is = x (buggy ,x))
+                               :value-form `(buggy ,x)
                                :body (lambda () (buggy x))
                                :expected x
                                :comparison '=)))))

We get a much more useful output:

? RANDOMIZED-TESTING::BUGGY
  0.000 ✔   (is = x (buggy 1))
  0.000 ✘   (is = x (buggy 2))
  0.000 ✔   (is = x (buggy 3))
  0.000 ✔   (is = x (buggy 4))
  0.000 ✔   (is = x (buggy 5))
  0.000 ✔   (is = x (buggy 6))
  0.000 ✔   (is = x (buggy 7))
  0.000 ✔   (is = x (buggy 8))
  0.000 ✔   (is = x (buggy 9))
  0.000 ✔   (is = x (buggy 10))
  0.004 ✘ RANDOMIZED-TESTING::BUGGY

;; Summary:
Passed:     9
Failed:     1
Skipped:    0

;; Failures:
   1/  10 tests failed in RANDOMIZED-TESTING::BUGGY
The test form   (buggy 2)
evaluated to    3
when            2
was expected to be equal under =.

Still too verbose for now.

Shinmera commented 1 year ago

In uax-9 I manually construct test results to avoid the eager capture of the default macros: https://github.com/Shinmera/uax-9/blob/master/test.lisp#L74

This leads to much better reports, but is obviously quite unwieldy.

Shinmera commented 1 year ago

My current thought is that we need some way to communicate variable parts in a test form to the test macro, so that it knows which parts to take from the runtime environment, and which parts to take from the expression. That would at least solve the reports issue. How exactly that could be worked out, I'm not sure. It is very likely though that we'll need separate analogues of is, etc. for this purpose.

Shinmera commented 1 year ago

Another thought: for instance, I would like to randomly generate trees, then perform a random sequence of interactions on the tree, checking an invariant at each step. Ideally the test report on failure would say something like:

The test form   (is-consistent tree)
evaluated to    NIL
when            T
was expected to be equal under geq
after operation (frob 5 tree)
with arguments
  5
  (a b (c d (e)) (((h))))
returned
  (a b c d . e)

Meaning: it captures the i/o on each operation, and then stores them in the failure report so we know how to replicate the results.

In cases where global state is modified we might also need to instruct it to capture that global state, too.

Shinmera commented 1 year ago

Generalising this: it seems useful to be able to attach some contextual operation to another test form. Maybe something like (true* (is-consistent tree) :after (frob 5 tree)) which could be generated from a macro like (is-invariant (is-consistent tree) (frob 1 tree) (frob 2 tree) ...)

Shinmera commented 1 year ago

To avoid macroising it too much, how about a call-context storage object which keeps the form, arguments, and result, and a macro (capture-call-context form context) to capture that data into such an object, which can then be passed as an argument to a normal result construction.

fstamour commented 1 year ago

Again, I'll come back later to add details, but I was able to hack something up (as in "it works, but the code is very horrible and I didn't think about all the implications yet"):

(define-test+run buggy
  (randomize ((x (gen-integer 0 3)))
    (is = x (identity x))
    (is = x (buggy x))))

And it outputs something like this:

;; Summary:
Passed:     0
Failed:     1
Skipped:    0

;; Failures:
   1/   1 tests failed in RANDOMIZED-TESTING::BUGGY
Randomized test failed after 11 passed and 0 skipped tests.
Seed: 16819773938075529869
With 
  X = 2
The test form   (buggy x)
evaluated to    3
when            2
was expected to be equal under =.

Note to myself: also print the failed result's expression.

fstamour commented 1 year ago

Here are the things I've tried:

breeze/scratch-files/randomized-testing.lisp

It's low quality, because I just wanted to try anything and because I needed to learn a bit more about how parachute works. Lastly, the code is in my project "breeze" because the whole point of this project is to try stuff.

So, I would like some feedback before I try making something cleaner.

Oh! I also thought about the DSL for generating more complex data (like trees), and I think it might not be that hard. Using a function that looks like like an eval, but that returns a function that takes a prng as input... Something along those lines

Shinmera commented 1 year ago

Sorry, I'm having trouble gleaming your thinking from the lisp file, so I'm not entirely sure what, concretely, you're proposing. Let me try to outline my current understanding, and please correct me if I'm wrong or missing something:

We want to be able to do three things:

  1. Provide context to a result, so that the report can tell us more about what was going on leading up to the failing test form. It should be convenient for the user to capture this "context", whether that be a single prior call, or a trace/sequence of calls leading up to the failure.
  2. Remove noise from the report somehow, either by not recording certain results at all, or by just not printing them when they're not relevant.
  3. Provide some way to generate randomised operations.

I'm not talking about data at all here, as 3. seems like a much more general and powerful concept than explicitly generating data structures. If we can support randomising operations, we can use that to generate the random data, too, and also test protocols that don't primarily rely on data structures, or rely on data structures that are highly private and annoying to construct via anything but these operations we want to test, anyhow.

I think 3. could also be left as a separate issue, as with 1. we can also leave it up to the user to decide how to "generate" the operations, as long as the system can take care of capturing the context we care about to analyse the failure.

Finally, 2. should be rather trivial and more a question of taste. I can see either approach working, the latter probably just with some kinda flag to mark results as "unimportant", which are then skipped in the report.

fstamour commented 1 year ago

Sorry, I'm having trouble gleaming your thinking from the lisp file

That's totally fine and I was pretty much expecting it.

We're on the same page about the 3 points.

But I think there's one thing we might not be exactly on the same page: I'm trying to implement 2 kinds of randomized tests: PBT and MBT (more on that in a bit), whereas you seem (to me) to be more focused on the MBT kind. I'm also trying to make sure that everything is reproducible by using PRNGs that can be seeded, and trying to make sure that it's going to be possible to add shriking later on.

Some background info

PBT was popularized by an Haskell library named QuickCheck, and now there are "quick/fast/rapid" test libraries everywhere xD (see the wikipedia page). I'm trying to use the same vocabulary where it make sense.

What I tried so far

Reporting and PBT

I tried two things to help with reporting:

Furthermore, the macro I use to rebind the *context* and *parent* looks like a let:

(randomize ((x (gen-integer 0 3)))
  (is = x (identity x)))

It expands into a loop that generates the data for the tests.

Implementation aside, I think I quite like this syntax. What about you?

Almost forgot the thing I like the most about it: it's easy to modify existing tests to use generated data.

MBT

For model-based testing, I did a very query and very very dirty prototype.

Finally, there's a loop that

My questions so far

  1. What do you think about showing the randomized-results as only 1, event though it actually ran many more?
  2. What do you think about stopping each randomized tests on the first failure?
  3. Do you think the shenanigans I did with the randomized-result and randomized-context? Is that the right direction, in your opinion?
  4. I think I should be able to get away with only 1 class that does the jobs of both randomized-result and randomized-context. Do you see anything wrong with that?

I have more, but let's start with that.

Roadmap

Here's my high-level plan:

  1. Make the reporting not horrible when generating thousands of tests
  2. Make it easy for people to generate data
  3. Make it easy to do MBT

I swear I didn't mean to write a novel, but I just thought about something (that is super-evident): I could split the randomized macro in 2:

Both parts would be bundled together in one randomized macro.

P.S. I didn't re-read all of this... Knowing myself, it's probably full of typos and weird syntax. I'll try to re-read it later when I have slept :sweat_smile:

Bonne nuit!

fstamour commented 12 months ago

More food for thought: https://lobste.rs/s/xk9mhh/run_property_tests_until_coverage_stops