IntersectMBO / plutus

The Plutus language implementation and tools
Apache License 2.0
1.57k stars 476 forks source link

[Epic] NEAT tests #6075

Closed effectfully closed 1 month ago

effectfully commented 5 months ago

NEAT tests are flawed in quite a few ways, but given their rather low priority and the fact that probably nobody is actually going to be working on them I'll post all of the original GitHub tickets in this thread instead of creating an issue for each ticket, just so that I don't flood the issues list.

effectfully commented 5 months ago

Currently in NEAT tests we use De Bruijn indices to represent variables. This has the advantage of playing well the whole lazy-search machinery, however we do lose some coverage because of that, since we’re not testing name shadowing.

Can we instead generate terms with regular TyNames and Names (modulo alpha)?

Should we test open terms too? E.g. type normalization has to handle them.

effectfully commented 5 months ago

Currently we only generate a single constant per built-in type in NEAT tests. An excerpt from a previous discussion (as a code snippet, because Jira hasn’t yet reached technological maturity required for handling the advanced concept of copy-pasting):

Here's how the stats look like on master currently with no changes:

      181800
    typed CK vs untyped CEK produce the same output:
      181800

And here's how the stats look if

convertTermConstant (TmIntegerG i) = Some $ ValueOf DefaultUniInteger i

is changed to:

convertTermConstant (TmIntegerG i) = i `seq` (Some $ ValueOf DefaultUniInteger i)

      189908
    typed CK vs untyped CEK produce the same output:
      189908

+8k tests! And the same happens if we make TmIntegerG strict:

TmIntegerG !Integer

So how many integer constants do we actually generate on master? Only one constant: 0. Here's how to check that: changing

check (TyBuiltinG TyIntegerG ) (TmIntegerG _) = true

to

check (TyBuiltinG TyIntegerG ) (TmIntegerG i) = toCool $ i == 0

gives us exactly the same amount of generated tests as with no changes on master at all:

      181800
    typed CK vs untyped CEK produce the same output:
      181800

And making it

check (TyBuiltinG TyIntegerG ) (TmIntegerG i) = toCool $ i > 2^32

so that integers don't get generated at all gives us this:

      178294
    typed CK vs untyped CEK produce the same output:              OK (21.48s)
      178294

By the way, making TmStringG strict results in

      Exception: Cannot decode byte '\x80': Data.Text.Internal.Encoding.decodeUtf8: Invalid UTF-8 stream

I.e. the generator for Text is not correct, which can be fixed by making it

  enumerate = share $ fmap Text.pack access

Finally, making all constructors of TermConstantG strict gives us the following stats:

      193007
    typed CK vs untyped CEK produce the same output:
      193007

So, summarizing:

  1. we only generate one constant for each built-in type
  2. fixing that gives us 11207 new tests cases, i.e. +6.2%
effectfully commented 5 months ago

The types AST used in NEAT generators represents type application like this:

data TypeG n
    = <...>
    | TyAppG (TypeG n) (TypeG n) (Kind ())

This is so that no inference needs to happen: the generated type only gets checked against a kind.

Generating those Kinds in TyAppG most likely is cheap, however not generating them would be a step towards resolving the "Can we use our regular AST for NEAT tests? Should we?" part below and that latter thing is supposed to give us great performance benefits.

Would something like this (pseudocode)

inferKind (TyAppG fun arg) = do
    domToCod <- inferKind fun
    case domToCod of
        KindArror () dom cod -> do
            cod <$ checkKind arg dom
        _ -> throwError "blah"

checkKind (TyAppG fun arg) cod = do
    dom <- inferKind arg
    checkKind fun $ KindArrow () dom cod

work?

Should we throw errors impurely for better laziness and more eager pruning? And then force the nodes of the inferred kind/type in a check function and catch those thrown errors.

effectfully commented 5 months ago

Currently we have an entire new AST of the whole Plutus Core language just so that we can run the NEAT tests over it, plus the separate type normalizer, type checker etc that come with it. Is this really necessary?

There’s a good reason why the NEAT machinery suggests you should do it this way: say you generate an application f x, then you don’t want to enumerate all possible fs if it’s impossible to generate a single suitable x. Instead you want to BFS: first ensure there’s at least one suitable f, then ensure there’s at least one suitable x and only then proceed with generating all suitable pairs.

But in our case there’s always a type of a given kind and a term of a given type. So are we actually saving anything by using a BFS-compatible AST? Well, we can check it quickly:

These are the stats with the OF strategy (optimal short-circuiting and fairness):

  NEAT
    normalization commutes with conversion from generated types:  OK (9.40s)
      248588
    normal types cannot reduce:                                   OK (5.90s)
      602127
    type preservation - CK:                                       OK (14.80s)
      181800
    typed CK vs untyped CEK produce the same output:              OK (15.78s)
      181800

These are the stats with the D strategy (sequential):

  NEAT
    normalization commutes with conversion from generated types:  OK (8.95s)
      248588
    normal types cannot reduce:                                   OK (5.51s)
      602127
    type preservation - CK:                                       OK (147.08s)
      181800
    typed CK vs untyped CEK produce the same output:              OK (149.52s)
      181800

We can see that there’s no difference when it comes to testing type-level stuff and there’s a whopping 10x difference when it comes to testing term-level stuff.

Is this accidental or inherent? Can we make D perform as well as OF? If so, should we bother? OF is more flexible, perhaps we shouldn’t drive ourselves into a corner? On the other hand if we could set up NEAT tests to work with our normal AST, that would be a huge simplification of the machinery and it would automatically speed everything up, because our normal type checker is fairly efficient.

effectfully commented 5 months ago

The NEAT machinery insists on exhaustively generating all values satisfying a predicate, but do we really want to enumerate all the constants of built-in types? From previous experiments it’s clear that generating all the constants as opposed to generating only one per built-in type gives us +6.2% in the number of total test cases, which I think is fine. However James ran into some problems with adding support for lists and pairs (the number of test cases blowing up), so maybe we should consider somehow integrating QuickCheck's Arbitrary into our NEAT tests? Although I’m not sure what were the problems James ran into, maybe it wasn’t about constants, but rather type checking or whatever. In any case we should consider using Arbitrary within NEAT, because I doubt we need to exhaustively traverse the whole space of constants of built-in types.

effectfully commented 5 months ago

The NEAT generators are not very optimized, but is it a bottleneck? Maybe getting the inputs takes a fraction of the time that we spend on running the actual properties. We should investigate that.

effectfully commented 5 months ago

Currently we only generate terms of type Unit in NEAT tests, which restricts the shapes of the terms that get generated at all. In particular, no built-in function returns a Unit, so I guess most terms that we get are of these three forms (in the "readable" syntax):

(\(x : <some_type>) -> Unit) <some_term>
(/\(x :: <some_type>) -> Unit) {<some_type>}
unwrap (wrap <one_of_the_above>)

And note how the second one always compiles to the same untyped Core, so we get even less coverage in the end. And the last one always compiles to one of the first two, so even less coverage again.

We need to fix the NEAT generators, so that they produce terms of completely arbitrary types.

effectfully commented 5 months ago

We have this in the NEAT tests:

defaultFunTypes = Map.fromList [(TyFunG (TyBuiltinG TyIntegerG) (TyFunG (TyBuiltinG TyIntegerG) (TyBuiltinG TyIntegerG))
                   ,[AddInteger,SubtractInteger,MultiplyInteger,DivideInteger,QuotientInteger,RemainderInteger,ModInteger])
                  ,(TyFunG (TyBuiltinG TyIntegerG) (TyFunG (TyBuiltinG TyIntegerG) (TyBuiltinG TyBoolG))
                   ,[LessThanInteger,LessThanEqualsInteger,EqualsInteger])
                  <...>
                  ]

This is verbose and it’s super easy to miss a builtin in that pile. It shouldn’t be hard to generate all of those automatically from DefaultFun using the usual enumerate + toBuiltinMeaning business.

effectfully commented 5 months ago

The NEAT machinery uses small-step semantics for normalizing types. That is inefficient as the normalizer has to traverse the whole type over and over again each time diving slightly deeper for a new redex. This makes coverage of the NEAT tests smaller than what it otherwise could be, but it’s not clear by what margin, maybe it’s not a big deal, maybe it is. It would be good to have a more optimized type normalizer in any case.

effectfully commented 4 months ago

As a Plutus developer

I want to understand the testing strategy being used for comparing the Plutus evaluator with the Agda reference model.

So that I can check that the Neat implementation is correct.

effectfully commented 1 month ago

I doubt we're going to address these within the next several years or ever at all, so I'm closing the issue.