Open Shinmera opened 4 years ago
Added largescale report to address point 1.
Ah! I came here to create this issue :P
I'll come back to add details, ideas and a plan.
Ok, quick look at the readme. I think a new type of test make sense (I'll confirm once I've looked at the implementation.)
Here's the current behaviour, to help highlight what needs improvment:
(defpackage #:randomized-testing
(:documentation "Scratch file to explore randomized tests with parachute")
(:use #:cl)
(:import-from #:parachute
#:define-test
#:define-test+run
#:is
#:true
#:false))
(in-package #:randomized-testing)
;; Define a dummy function to test
(defun buggy (x)
"Like cl:identity, but returns the wrong value for 2."
(if (= 2 x) 3 x))
;; Define a dummy PRGN
(defun make-fake-prng ()
(let ((x 0))
(lambda ()
(incf x))))
;; That's what a radomized test would like _with the current implementation_
(define-test+run buggy
(let ((prng (make-fake-prng)))
(loop
:repeat 10 ; we run the randomized test 10 times
:for x = (funcall prng)
:do (is = x (buggy x)))))
;; Here's what the output looks like:
#|
? RANDOMIZED-TESTING::BUGGY
0.000 ✔ (is = x (buggy x))
0.000 ✘ (is = x (buggy x))
0.000 ✔ (is = x (buggy x))
0.000 ✔ (is = x (buggy x))
0.000 ✔ (is = x (buggy x)) ; These are not useful
0.000 ✔ (is = x (buggy x)) ; They are not descriptive either
0.000 ✔ (is = x (buggy x))
0.000 ✔ (is = x (buggy x))
0.000 ✔ (is = x (buggy x))
0.000 ✔ (is = x (buggy x))
0.004 ✘ RANDOMIZED-TESTING::BUGGY
;; Summary:
Passed: 9 ; Should we count them differently?
Failed: 1
Skipped: 0
;; Failures:
1/ 10 tests failed in RANDOMIZED-TESTING::BUGGY ; What about this count?
The test form (buggy x) ; For more complex tests, this would not be
evaluated to 3 ; enough to know which inputs failed the test
when 2
was expected to be equal under =.
#<PARACHUTE:PLAIN 11, FAILED results>
((IS = X (BUGGY X)))
|#
The user can improve the situation a bit, by supplying a description to the test result:
(define-test+run buggy
(let ((prng (make-fake-prng)))
(loop
:repeat 10 ; we run the randomized test 10 times
:for x = (funcall prng)
- :do (is = x (buggy x)))))
+ :do (is = x (buggy x) "Failed with input ~a" x))))
Then the output would be slightly improved:
[...]
;; Summary:
Passed: 9
Failed: 1
Skipped: 0
;; Failures:
1/ 10 tests failed in RANDOMIZED-TESTING::BUGGY
The test form (buggy x)
evaluated to 3
when 2
was expected to be equal under =.
+Failed with input 2
[...]
Edit based on Shinmera's comment about uax-9
If we expand the is
macro, the code looks like this:
(define-test+run buggy
(let ((prng (make-fake-prng)))
(loop
:repeat 10 ; we run the randomized test 10 times
:for x = (funcall prng)
:do (eval-in-context *context*
(make-instance 'comparison-result
:expression '(is = x (buggy x))
:value-form '(buggy x)
:body (lambda () (buggy x))
:expected x
:comparison '=)))))
And if we tweak it just a little bit:
(define-test+run buggy
(let ((prng (make-fake-prng)))
(loop
:repeat 10 ; we run the randomized test 10 times
:for x = (funcall prng)
:do (eval-in-context *context*
(make-instance 'comparison-result
- :expression '(is = x (buggy x))
- :value-form '(buggy x)
+ :expression `(is = x (buggy ,x))
+ :value-form `(buggy ,x)
:body (lambda () (buggy x))
:expected x
:comparison '=)))))
We get a much more useful output:
? RANDOMIZED-TESTING::BUGGY
0.000 ✔ (is = x (buggy 1))
0.000 ✘ (is = x (buggy 2))
0.000 ✔ (is = x (buggy 3))
0.000 ✔ (is = x (buggy 4))
0.000 ✔ (is = x (buggy 5))
0.000 ✔ (is = x (buggy 6))
0.000 ✔ (is = x (buggy 7))
0.000 ✔ (is = x (buggy 8))
0.000 ✔ (is = x (buggy 9))
0.000 ✔ (is = x (buggy 10))
0.004 ✘ RANDOMIZED-TESTING::BUGGY
;; Summary:
Passed: 9
Failed: 1
Skipped: 0
;; Failures:
1/ 10 tests failed in RANDOMIZED-TESTING::BUGGY
The test form (buggy 2)
evaluated to 3
when 2
was expected to be equal under =.
Still too verbose for now.
In uax-9
I manually construct test results to avoid the eager capture of the default macros: https://github.com/Shinmera/uax-9/blob/master/test.lisp#L74
This leads to much better reports, but is obviously quite unwieldy.
My current thought is that we need some way to communicate variable parts in a test form to the test macro, so that it knows which parts to take from the runtime environment, and which parts to take from the expression. That would at least solve the reports issue. How exactly that could be worked out, I'm not sure. It is very likely though that we'll need separate analogues of is
, etc. for this purpose.
Another thought: for instance, I would like to randomly generate trees, then perform a random sequence of interactions on the tree, checking an invariant at each step. Ideally the test report on failure would say something like:
The test form (is-consistent tree)
evaluated to NIL
when T
was expected to be equal under geq
after operation (frob 5 tree)
with arguments
5
(a b (c d (e)) (((h))))
returned
(a b c d . e)
Meaning: it captures the i/o on each operation, and then stores them in the failure report so we know how to replicate the results.
In cases where global state is modified we might also need to instruct it to capture that global state, too.
Generalising this: it seems useful to be able to attach some contextual operation to another test form. Maybe something like (true* (is-consistent tree) :after (frob 5 tree))
which could be generated from a macro like (is-invariant (is-consistent tree) (frob 1 tree) (frob 2 tree) ...)
To avoid macroising it too much, how about a call-context
storage object which keeps the form, arguments, and result, and a macro (capture-call-context form context)
to capture that data into such an object, which can then be passed as an argument to a normal result
construction.
Again, I'll come back later to add details, but I was able to hack something up (as in "it works, but the code is very horrible and I didn't think about all the implications yet"):
(define-test+run buggy
(randomize ((x (gen-integer 0 3)))
(is = x (identity x))
(is = x (buggy x))))
And it outputs something like this:
;; Summary:
Passed: 0
Failed: 1
Skipped: 0
;; Failures:
1/ 1 tests failed in RANDOMIZED-TESTING::BUGGY
Randomized test failed after 11 passed and 0 skipped tests.
Seed: 16819773938075529869
With
X = 2
The test form (buggy x)
evaluated to 3
when 2
was expected to be equal under =.
Note to myself: also print the failed result's expression.
Here are the things I've tried:
breeze/scratch-files/randomized-testing.lisp
It's low quality, because I just wanted to try anything and because I needed to learn a bit more about how parachute works. Lastly, the code is in my project "breeze" because the whole point of this project is to try stuff.
So, I would like some feedback before I try making something cleaner.
Oh! I also thought about the DSL for generating more complex data (like trees), and I think it might not be that hard. Using a function that looks like like an eval, but that returns a function that takes a prng as input... Something along those lines
Sorry, I'm having trouble gleaming your thinking from the lisp file, so I'm not entirely sure what, concretely, you're proposing. Let me try to outline my current understanding, and please correct me if I'm wrong or missing something:
We want to be able to do three things:
I'm not talking about data at all here, as 3. seems like a much more general and powerful concept than explicitly generating data structures. If we can support randomising operations, we can use that to generate the random data, too, and also test protocols that don't primarily rely on data structures, or rely on data structures that are highly private and annoying to construct via anything but these operations we want to test, anyhow.
I think 3. could also be left as a separate issue, as with 1. we can also leave it up to the user to decide how to "generate" the operations, as long as the system can take care of capturing the context we care about to analyse the failure.
Finally, 2. should be rather trivial and more a question of taste. I can see either approach working, the latter probably just with some kinda flag to mark results as "unimportant", which are then skipped in the report.
Sorry, I'm having trouble gleaming your thinking from the lisp file
That's totally fine and I was pretty much expecting it.
We're on the same page about the 3 points.
But I think there's one thing we might not be exactly on the same page: I'm trying to implement 2 kinds of randomized tests: PBT and MBT (more on that in a bit), whereas you seem (to me) to be more focused on the MBT kind. I'm also trying to make sure that everything is reproducible by using PRNGs that can be seeded, and trying to make sure that it's going to be possible to add shriking later on.
PBT was popularized by an Haskell library named QuickCheck, and now there are "quick/fast/rapid" test libraries everywhere xD (see the wikipedia page). I'm trying to use the same vocabulary where it make sense.
I tried two things to help with reporting:
is
macro to recognize some special syntax, and generate several comparison-result
with the right expression.
is
, true
, etc.) with a macro that overrides *parent*
and *context*
.
randomized-context
that prevents tests from being reported (to reduce the noise). randomized-result
that keeps track of the actual number of test resultsrandomized-result
it prints out the seed and the generated values; and then print the actual test result (a comparison-result
in my prototype).Furthermore, the macro I use to rebind the *context*
and *parent*
looks like a let
:
(randomize ((x (gen-integer 0 3)))
(is = x (identity x)))
It expands into a loop that generates the data for the tests.
Implementation aside, I think I quite like this syntax. What about you?
Almost forgot the thing I like the most about it: it's easy to modify existing tests to use generated data.
For model-based testing, I did a very query and very very dirty prototype.
list
in the code...)push
, is always applicablepop
, only applicable when the length of the list is positivelength
, always applicableFinally, there's a loop that
length
of the list should be =
to the model)randomized-results
as only 1, event though it actually ran many more?randomized-result
and randomized-context
? Is that the right direction, in your opinion?randomized-result
and randomized-context
. Do you see anything wrong with that?I have more, but let's start with that.
Here's my high-level plan:
I swear I didn't mean to write a novel, but I just thought about something (that is super-evident): I could split the randomized
macro in 2:
Both parts would be bundled together in one randomized
macro.
P.S. I didn't re-read all of this... Knowing myself, it's probably full of typos and weird syntax. I'll try to re-read it later when I have slept :sweat_smile:
Bonne nuit!
More food for thought: https://lobste.rs/s/xk9mhh/run_property_tests_until_coverage_stops
Parachute should support testing where the tests are runtime (possibly randomly) generated better. In particular, this would involve: