Immediate overlapping tokenizations of DFA inputs when doing incremental GLR

cronburg / antlr-haskell

A language parsing quasiquoter for Haskell based heavily on ANTLR4.

Other

49 stars 7 forks source link

Immediate overlapping tokenizations of DFA inputs when doing incremental GLR #31

Open cronburg opened 5 years ago

cronburg commented 5 years ago

The incremental GLR parsing algorithm doesn't take into account every possible alternative tokenization when looking for one that will result in a successful parse. Namely if two lexical rules can both match on the same terminal string, and both of them are feasible (based on a lookahead of just the one terminal symbol) then the incremental GLR parser currently makes the first one listed in the G4 grammar take precedence, never trying the second one and reporting a parse failure if the first one became infeasible due to subsequent attempted parsing.

Incremental GLR extension: either have the tokenizer return a (lazy) list of terminal tokens that successfully match on the current input (in order of precedence), and/or have the GLR parser recall the tokenizer with one fewer DFA name when parseResults does not contain any successful parses.

cronburg commented 5 years ago

Coding difficulty: medium (simple logical change to glrParsInc', including possible change to its type signature with what a Tokenizer does)
Mental difficulty: medium-low (glrParseInc' is a self-contained, small piece of code)

cronburg commented 5 years ago

To get started on this issue, check out the unit test in test/lr/GLRInc.hs and run it with:

stack test antlr-haskell:lr

to see the error message:

ErrorNoAction: Current input = ' '
       Current state = <2>        
       Rest of input = ' - foo'

Note that if you swap the Prim and LowerID lexeme declarations in GLRInc.hs the test case should pass with the expected behavior (a successful parse), but that simple fix defeats the larger purpose of this issue.