Closed effectfully closed 1 month ago
Currently in NEAT tests we use De Bruijn indices to represent variables. This has the advantage of playing well the whole lazy-search machinery, however we do lose some coverage because of that, since we’re not testing name shadowing.
Can we instead generate terms with regular TyNames and Names (modulo alpha)?
Should we test open terms too? E.g. type normalization has to handle them.
Currently we only generate a single constant per built-in type in NEAT tests. An excerpt from a previous discussion (as a code snippet, because Jira hasn’t yet reached technological maturity required for handling the advanced concept of copy-pasting):
Here's how the stats look like on master
currently with no changes:
181800
typed CK vs untyped CEK produce the same output:
181800
And here's how the stats look if
convertTermConstant (TmIntegerG i) = Some $ ValueOf DefaultUniInteger i
is changed to:
convertTermConstant (TmIntegerG i) = i `seq` (Some $ ValueOf DefaultUniInteger i)
189908
typed CK vs untyped CEK produce the same output:
189908
+8k tests! And the same happens if we make TmIntegerG
strict:
TmIntegerG !Integer
So how many integer constants do we actually generate on master
? Only one constant: 0
. Here's how to check that: changing
check (TyBuiltinG TyIntegerG ) (TmIntegerG _) = true
to
check (TyBuiltinG TyIntegerG ) (TmIntegerG i) = toCool $ i == 0
gives us exactly the same amount of generated tests as with no changes on master
at all:
181800
typed CK vs untyped CEK produce the same output:
181800
And making it
check (TyBuiltinG TyIntegerG ) (TmIntegerG i) = toCool $ i > 2^32
so that integers don't get generated at all gives us this:
178294
typed CK vs untyped CEK produce the same output: OK (21.48s)
178294
By the way, making TmStringG
strict results in
Exception: Cannot decode byte '\x80': Data.Text.Internal.Encoding.decodeUtf8: Invalid UTF-8 stream
I.e. the generator for Text
is not correct, which can be fixed by making it
enumerate = share $ fmap Text.pack access
Finally, making all constructors of TermConstantG
strict gives us the following stats:
193007
typed CK vs untyped CEK produce the same output:
193007
So, summarizing:
The types AST used in NEAT generators represents type application like this:
data TypeG n
= <...>
| TyAppG (TypeG n) (TypeG n) (Kind ())
This is so that no inference needs to happen: the generated type only gets checked against a kind.
Generating those Kinds in TyAppG most likely is cheap, however not generating them would be a step towards resolving the "Can we use our regular AST for NEAT tests? Should we?" part below and that latter thing is supposed to give us great performance benefits.
Would something like this (pseudocode)
inferKind (TyAppG fun arg) = do
domToCod <- inferKind fun
case domToCod of
KindArror () dom cod -> do
cod <$ checkKind arg dom
_ -> throwError "blah"
checkKind (TyAppG fun arg) cod = do
dom <- inferKind arg
checkKind fun $ KindArrow () dom cod
work?
Should we throw errors impurely for better laziness and more eager pruning? And then force the nodes of the inferred kind/type in a check function and catch those thrown errors.
Currently we have an entire new AST of the whole Plutus Core language just so that we can run the NEAT tests over it, plus the separate type normalizer, type checker etc that come with it. Is this really necessary?
There’s a good reason why the NEAT machinery suggests you should do it this way: say you generate an application f x, then you don’t want to enumerate all possible fs if it’s impossible to generate a single suitable x. Instead you want to BFS: first ensure there’s at least one suitable f, then ensure there’s at least one suitable x and only then proceed with generating all suitable pairs.
But in our case there’s always a type of a given kind and a term of a given type. So are we actually saving anything by using a BFS-compatible AST? Well, we can check it quickly:
These are the stats with the OF strategy (optimal short-circuiting and fairness):
NEAT
normalization commutes with conversion from generated types: OK (9.40s)
248588
normal types cannot reduce: OK (5.90s)
602127
type preservation - CK: OK (14.80s)
181800
typed CK vs untyped CEK produce the same output: OK (15.78s)
181800
These are the stats with the D strategy (sequential):
NEAT
normalization commutes with conversion from generated types: OK (8.95s)
248588
normal types cannot reduce: OK (5.51s)
602127
type preservation - CK: OK (147.08s)
181800
typed CK vs untyped CEK produce the same output: OK (149.52s)
181800
We can see that there’s no difference when it comes to testing type-level stuff and there’s a whopping 10x difference when it comes to testing term-level stuff.
Is this accidental or inherent? Can we make D perform as well as OF? If so, should we bother? OF is more flexible, perhaps we shouldn’t drive ourselves into a corner? On the other hand if we could set up NEAT tests to work with our normal AST, that would be a huge simplification of the machinery and it would automatically speed everything up, because our normal type checker is fairly efficient.
Arbitrary
with NEAT tests for generating constants? Should we?The NEAT machinery insists on exhaustively generating all values satisfying a predicate, but do we really want to enumerate all the constants of built-in types? From previous experiments it’s clear that generating all the constants as opposed to generating only one per built-in type gives us +6.2% in the number of total test cases, which I think is fine. However James ran into some problems with adding support for lists and pairs (the number of test cases blowing up), so maybe we should consider somehow integrating QuickCheck's Arbitrary into our NEAT tests? Although I’m not sure what were the problems James ran into, maybe it wasn’t about constants, but rather type checking or whatever. In any case we should consider using Arbitrary within NEAT, because I doubt we need to exhaustively traverse the whole space of constants of built-in types.
The NEAT generators are not very optimized, but is it a bottleneck? Maybe getting the inputs takes a fraction of the time that we spend on running the actual properties. We should investigate that.
Currently we only generate terms of type Unit in NEAT tests, which restricts the shapes of the terms that get generated at all. In particular, no built-in function returns a Unit, so I guess most terms that we get are of these three forms (in the "readable" syntax):
(\(x : <some_type>) -> Unit) <some_term>
(/\(x :: <some_type>) -> Unit) {<some_type>}
unwrap (wrap <one_of_the_above>)
And note how the second one always compiles to the same untyped Core, so we get even less coverage in the end. And the last one always compiles to one of the first two, so even less coverage again.
We need to fix the NEAT generators, so that they produce terms of completely arbitrary types.
We have this in the NEAT tests:
defaultFunTypes = Map.fromList [(TyFunG (TyBuiltinG TyIntegerG) (TyFunG (TyBuiltinG TyIntegerG) (TyBuiltinG TyIntegerG))
,[AddInteger,SubtractInteger,MultiplyInteger,DivideInteger,QuotientInteger,RemainderInteger,ModInteger])
,(TyFunG (TyBuiltinG TyIntegerG) (TyFunG (TyBuiltinG TyIntegerG) (TyBuiltinG TyBoolG))
,[LessThanInteger,LessThanEqualsInteger,EqualsInteger])
<...>
]
This is verbose and it’s super easy to miss a builtin in that pile. It shouldn’t be hard to generate all of those automatically from DefaultFun using the usual enumerate + toBuiltinMeaning business.
The NEAT machinery uses small-step semantics for normalizing types. That is inefficient as the normalizer has to traverse the whole type over and over again each time diving slightly deeper for a new redex. This makes coverage of the NEAT tests smaller than what it otherwise could be, but it’s not clear by what margin, maybe it’s not a big deal, maybe it is. It would be good to have a more optimized type normalizer in any case.
As a Plutus developer
I want to understand the testing strategy being used for comparing the Plutus evaluator with the Agda reference model.
So that I can check that the Neat implementation is correct.
I doubt we're going to address these within the next several years or ever at all, so I'm closing the issue.
NEAT tests are flawed in quite a few ways, but given their rather low priority and the fact that probably nobody is actually going to be working on them I'll post all of the original GitHub tickets in this thread instead of creating an issue for each ticket, just so that I don't flood the issues list.