gallais / idris-tparsec

TParsec - Total Parser Combinators in Idris
https://gallais.github.io/idris-tparsec/
GNU General Public License v3.0
93 stars 10 forks source link

Error reporting #6

Closed clayrat closed 6 years ago

clayrat commented 6 years ago

This currently seems to be the biggest feature missing from TParsec. Do you have any suggestions on how to implement this better? Maybe by changing the Success structure to be a variant type Result = Error | Success?

gallais commented 6 years ago

I gave some ideas in gallais/agdarsec#8 but never got around to trying them out.

clayrat commented 6 years ago

I've started the port of parameters branch.

gallais commented 6 years ago

With hindsight, I would tackle the problem in two steps:

The introduction of Parameters already raises some issues: I ended up using Subset Char P.Tok in Text.Parser.Char for instance because that made instance resolution easier.

clayrat commented 6 years ago

Yes, that's the conclusion I've also reached after I started porting the examples :) In Idris the problem is not only with widening combinators and introducing Subsets, but also with types and implicits for running and constructing concrete parsers, e.g., for now I could only make everything work with code like this:

parseMaybe : (str : String) -> 
             (MonadRun mn, Instrumented p mn, Tokenizer (Tok p), 
              Subset (SizedList (Tok p) (length $ tokenize {tok = Tok p} str)) (Toks p n)) =>
             (All (Parser p mn a)) -> Maybe a
parseMaybe @{str} @{mr} @{is} @{to} @{su} par =
  let 
    input = MkSizedList $ tokenize @{to} str 
    result = runParser par lteRefl (into @{su} input)
    valid  = \s => toMaybe (Size s == Z) (Value s)
    in
  traverse valid (runMonad result) >>= head'
PAR' : All (Parser' ())
PAR' = fix (Parser' ()) $ \rec => 
        alts [ cmap () ((exact {p=Params} {mn=Maybe} LPAR `andm` rec) `land`
                        (exact {p=Params} {mn=Maybe} RPAR `andm` rec))
             , cmap () ((exact {p=Params} {mn=Maybe} LCUR `andm` rec) `land`
                        (exact {p=Params} {mn=Maybe} RCUR `andm` rec))
             , cmap () ((exact {p=Params} {mn=Maybe} LSQU `andm` rec) `land`
                        (exact {p=Params} {mn=Maybe} RSQU `andm` rec))
             ]
clayrat commented 6 years ago

I'm actually inclined to try moving tok/toks back to the Parser type and leaving only the annotation in Parameters, this might help a lot with the resolution.

gallais commented 6 years ago

This is strange: in PAR' all the ps and mn should be solved by unification! :/

gallais commented 6 years ago

As for the runner, I think you could legitimately demand that the user provides a function of type: (ts : List Tok) -> Toks (lenght ts).

Or you could introduce an ISubset:

record ISubset (S T : i -> Type) where
  iinto : {j : i} -> S j -> T j

and make the constraint more general: ISubset (SizedList (Tok p)) (Toks p). This way you don't have to tokenize the string in the type which is... scary.

clayrat commented 6 years ago

This is strange: in PAR' all the ps and mn should be solved by unification! :/

Ok, turns out mn is redundant indeed, but you still need to match on p. The problem is, AFAIU, that Idris doesn't expand the definition of records when running unification, thus Tok p doesn't evaluate to PAR because it doesn't want to look at what p is.

gallais commented 6 years ago

Unification not valid modulo beta-conversion? This sounds like a bug.

I have to go run some errands but I'll try to have a look later tonight.

clayrat commented 6 years ago

The funny thing is that if you do something like this

PAR' : All (Parser' ())
PAR' = fix (Parser' ()) $ \rec => cmap () $ exact ?wat

inspecting the hole gives you everything you want:

  j : Nat
  j1 : Nat
  rec : Box (Parser (MkParameters PAR (SizedType (List PAR)) () Void) Maybe ()) j1
  mn : Type -> Type
  p : Parameters
  constraint : Functor Maybe
  constraint1 : Instrumented (MkParameters PAR (SizedType (List PAR)) () Void) Maybe
  j2 : Nat
  constraint2 : Alternative Maybe
  constraint3 : Monad Maybe
  constraint4 : Instrumented (MkParameters PAR (SizedType (List PAR)) () Void) Maybe
  constraint5 : Inspect (SizedType (List PAR)) PAR
  constraint6 : Eq PAR
--------------------------------------
wat : PAR

but as soon as you plug it with a constructor, you get

        Type mismatch between
                PAR
        and
                Tok p
clayrat commented 6 years ago

I've pushed the current state of affairs to parameters branch if you'd like to take a look yourself.

gallais commented 6 years ago

Ok, I think I may understand what the issue is. You can get a similar error in the REPL if you call exact LPAR with :t:

*Examples/Parentheses> :t exact LPAR
Can't find implementation for Eq (Tok p)

Possible cause:
        (input):1:4-13:When checking an application of function TParsec.Combinators.exact:
                Type mismatch between
                        PAR (Type of LPAR)
                and
                        Tok p (Expected type)

I think the problem is that Idris tries to infer the type of exact LPAR and faces a unification constraint of the form PAR ~ Tok p. Instead of delaying the resolution of the constraint to a later point in time it fails immediately. But if it had waited it could have solved other unification problems which would have provided the extra information needed for Tok p to become Tok Params which can compute to PAR.

The solution I've pushed is to use a variable instead of a concrete value. In that case Idris can simply assign the variables the type Tok p which later gets refined to Tok Params thanks to other successful unifications. So by introducing a let-binding we can get Idris to assign the expression a type of the form PAR -> PAR -> ....

lRrR = \ p, q => cmap () ((exact p `andm` rec) `land` (exact q `andm` rec))

You can recreate the problem by replacing p in the body of lRrR with the PAR p. We once again force Idris to consider an problem of the form PAR ~ Tok p and it doesn't like it at all!

*Examples/Parentheses> :e
Type checking ./Examples/Parentheses.idr
./Examples/Parentheses.idr:41:52-57:
   |
41 |   let lRrR = \ p, q => cmap () ((exact (the PAR p) `andm` rec) `land` (exact q `andm` rec))
   |                                                    ~~~~~~
When checking right hand side of PAR' with expected type
        Parser (MkParameters PAR (SizedType (List PAR)) () Void) Maybe () j

When checking an application of function Induction.Nat.fix:
        Type mismatch between
                PAR
        and
                Tok p1

Holes: Examples.Parentheses.PAR'
clayrat commented 6 years ago

Nice, thanks! I should add this to my personal collection of Idris hacks, which already includes such gem as "replacing foo x y with let t = foo x y in t" - this one, sadly, doesn't work in this case :)

gallais commented 6 years ago

I'm intrigued. When is it useful to replace foo x y with let t = foo x y in t?

clayrat commented 6 years ago

For example, here: https://github.com/clayrat/idris-datadata/blob/master/src/DataData/Lec2.idr#L76 There were a few other instances, but this is the most recent one.

clayrat commented 6 years ago

I realize this particular case is very similar to what we have above (let-binding a function), but there definitely were instances where t was a plain value that refused to typecheck otherwise, I just can't find them now somehow :(

clayrat commented 6 years ago

Ok, seems the unification bug is happening only when you introduce custom tokens, and a mechanical workaround is to provide "specialized" versions of token combinators, like these:

exactTOK : TOK -> All (Parser' TOK)   
exactTOK = exact

maybeTOK : (TOK -> Maybe Char) -> All (Parser' Char)   
maybeTOK = maybeTok
clayrat commented 6 years ago

So looks like I'm done with the port (at least of the current state in agdarsec), do you have any suggestions?

gallais commented 6 years ago

I haven't tested it but I think that the trick now is to define a monad transformer that looks something like:

ParserT m = StateT (Position * List Annotation) (EitherT (Position * List Annotation) m)

and to make sure that the Alternative (ParserT m) instance implements empty by grabbing the current state and returning that as the failure.

To find the motivation to write a position-aware example I've started a new project (https://github.com/gallais/agdarky) yesterday evening. I now have the scopechecker & typechecker ready and the next step is probably the parser. So pretty soon agdarsec should be moving forward some more with ParserT.

gallais commented 6 years ago

I now have something I am pretty happy with in agdarky: https://github.com/gallais/agdarky/blob/master/src/stlc/Parse.agda#L73

It is a more informative version of the STLC example. It is also perhaps a bit confusing because I am encoding the parse result using my generic-syntax library (this allows me to get scopechecking for free and then typechecking, renaming, substitution, printing on the terms for free or relatively cheap).

Note that to make use of Instrumented's getPosition to annotate my parsed term with source position information I had to add new combinators to agdarsec: https://github.com/gallais/agdarsec/commit/a84181ee0c7df0fcd1bc9fdd25629cc0f0026787

clayrat commented 6 years ago

iI've tried applying these ideas to the STLC example in https://github.com/gallais/idris-tparsec/commit/623aaf9e4b3fb471d7c3330fd672363c325e587b, not sure if I got it right?

UPD: Seems like it's not, the returned position never goes beyond 0:1 :/

UPD2: Ah, I should've probably stored the positions in the parsed AST, not threaded them through. Still I wonder why it's not increasing. I suspect it's picking up a wrong implementation for the state monad...

gallais commented 6 years ago

Ah, I should've probably stored the positions in the parsed AST

Yeah, I think that would make the example more convincing.

Still I wonder why it's not increasing.

It's picking up the position when going through lam's mand getPosition (withSpaces var) and ignores the rest because of the recursive call in language which drops the Position with map snd . val.

So you're getting the position right after the '\' i.e. (0, 1).

clayrat commented 6 years ago
Examples/STLCins> parse' "\\ x . (\\ y . y : 'a -> 'a) x" (val stlc)
Right (Lam (MkPosition 0 1)
           "x"
           (Emb (MkPosition 0 6)
                (App (MkPosition 0 27)
                     (Cut (MkPosition 0 15)
                          (Lam (MkPosition 0 8) "y" (Emb (MkPosition 0 13) (Var "y")))
                          (ARR (K "a") (K "a")))
                     (Emb (MkPosition 0 27) (Var "x"))))) : Either Error Val

It seems to work now! Though all the errors are still reported at 0:1 for some reason. Also it's a bit weird that Cut is annotated at colon in col 15 (though this seems expected) and App at col 27?

gallais commented 6 years ago

Though all the errors are still reported at 0:1 for some reason.

Strange indeed.

Also it's a bit weird that Cut is annotated at colon in col 15 (though this seems expected) and App at col 27?

In the parser the position of a Cut is considered to be the colon : and the position of an App is considered to be the space between the function and its argument. The Cut is in the function part of the application so it makes sense that its position would come before that of the application.

gallais commented 6 years ago

I think I may understand what is happening.

When you write the simple grammar distinguishing two strings:

G := "ab" | "ba"

If the input string is "ac" then you enter the first branch, match 'a' and then fail on 'b'. So you roll-back to try the second branch and fail immediately on 'b'. Instead of reporting that error occurred in the first branch at the second character, you report that it has happened on the second branch at the first character.

I suppose one solution would be a mechanism to "commit" to one branch (and completely ignore the alternatives set up previously). In the STLC case the grammar is unambiguous so committing as soon as one has read a '\' or an opening parenthesis should work.

Alternatively we could tweak the disjunction so that it doesn't throw away the error in the first branch like it currently does:


  (ST a) <|> (ST b) = ST $ \pos => case a pos of 
                                     Left _ => b pos
                                     Right x => Right x

We could instead write something like this:

  (ST a) <|> (ST b) = ST $ \pos =>
    case a pos of 
      Right x => Right x
      Left e1 => case b pos of
        Right x => Right x
        Left e2 => Left $ heuristic e1 e2             

where heuristics picks its favourite error out of the two e.g. the "deepest" one.

This could be done with the current apparatus by comparing the positions and keeping the one the furthest away from the start. Or we could add a Nat to the state + error measuring how deep the context was when the failure was recorded (we'd increase that Nat each time we go under a \\ or an opening parenthesis for instance)

clayrat commented 6 years ago

It sounds like both mechanisms could be useful on their own - for unambiguous grammars the first one should probably be faster, while the second would help with ambiguous ones.

gallais commented 6 years ago

What I like about this is that it's all purely in the monad instance. So we don't actually have to touch the combinators to change the behaviour of the parser!

clayrat commented 6 years ago

Nice, I've ported the commit example. Next, I guess it'd be cool to cook up an example of a heuristic-driven reporting for an ambiguous case, and then move the relevant parts to the core. Then we can merge to master and finally close the issue :)

clayrat commented 6 years ago

I've added a trivial example for the heuristic error: https://github.com/gallais/idris-tparsec/blob/parameters/src/Examples/Ambig.idr. We could replace it with the Nat-indexed version for STLC, as you described above; or maybe there's some other non-trivial ambiguous grammar you can think of?

gallais commented 6 years ago

As far as I can tell we are only using recordToken from Instrumented. I think we should drop the other constraints as well as the Ann and Pos types in Parameters. I also think it would be worth dropping Instrumented altogether and putting recordToken into Parameters.

End users can always recover the same functionality based on the monad stack they are using. We should make that clear in the examples but we do not need to bake it into a specific constraint.

And that should yield the final design for this new version!

gallais commented 6 years ago

We're almost there: https://github.com/gallais/agdarsec/pull/11

The newer things are in Text.Parser.Monad where we have different ready-to-use stacks:

clayrat commented 6 years ago

So, I've ported your latest PR in https://github.com/gallais/idris-tparsec/commit/632fafd419ed02c47a902863874567d37bc3478a. Made some changes - since implementations have to be constructors, all transformers need wrapping in records. Also I've put most of the new stuff directly in TParsec.Types, moving Result out. Now I just need to port the instrumented version of STLC, and think up something about ambiguous example, do you have any advice on that one?

gallais commented 6 years ago

Fixed by bd740fef2a419d1d3a42710f86b4c4f1761a6ffa