ALL(*) improvements intended to deal with GLR performance issues

cronburg commented 5 years ago

[ ] Replace AST type with (basically identical) antlr-haskell AST type.
[ ] Call ALL(*) from the 'g4_codeGen' function when parsing G4 things (instead of GLR which grinds to a halt on large grammars).
[ ] Make ALL(*) give the tokenizer information about the current state of the parser (much like in incremental GLR).
[ ] Unify the tokenization methods of ALL(*) and GLR into a single module (probably under src/Text/ANTLR/Lex/Tokenizer.hs)

tmh2211 commented 5 years ago

Want to use your PG for a study project. It parses correctly, but when using glrParse it takes almost a minute to parse an input string with only one token. Is there a workaround?

Edit: Used glrParseFast, same result

cronburg commented 5 years ago

@tmh2211 - you have two options right now:

1) Use one of the other parser algorithms (slrParse, allstarParse, lr1Parse, predictiveParse, I forget if there are others...)

2) Write the grammar so as to avoid ambiguities (which GLR as an algorithm can handle, but handles very slowly in this implementation). For example the "chisel" example in test/chisel runs glrParse on an input with over 100 tokens and takes about 7 seconds to run:

$ time stack test :chisel
...
ResultAccept (AST NT_chiselProd [NT NT_prodSimple] [AST NT_prodSimple [NT NT_prodID,NT NT_formals,T T_2,NT NT_group] [AST NT_prodID [T T_UpperID] [Leaf (Token T_UpperID (V_UpperID "Foo") 3)],AST NT_formals [T T_LowerID] [Leaf (Token T_LowerID (V_LowerID "x") 1)],Leaf (Token T_2 V_2 2),AST NT_group [NT NT_groupExp1] [AST NT_groupExp1 [NT NT_arith,NT NT_prodApp] [AST NT_arith [T T_LowerID] [Leaf (Token T_LowerID (V_LowerID "x") 1)],AST NT_prodApp [NT NT_prodID] [AST NT_prodID [T T_UpperID] [Leaf (Token T_UpperID (V_UpperID "Bar") 3)]]]]]])
Parse Test (Small): [OK]
Parse GHC: [OK]
Prettify speed: [OK]
Run glrParseFast build: [OK]

         Test Cases  Total      
 Passed  7           7          
 Failed  0           0          
 Total   7           7          

antlr-haskell-0.1.0.0: Test suite chisel passed

real    0m4.600s
user    0m7.720s
sys 0m0.873s

There's definitely a performance bug related to GLR somewhere (I ran into exactly the issue your describing with only one or two tokens of input taking minutes to parse), but I don't know why yet. The way I've avoided this until now is to only write grammars such that valid inputs to the parser should always produce a parse forest of size one (i.e. only one valid parse, with no ambiguities). This workaround has its own performance issues, but that's less of a bug and more of a result of the algorithm doing backtracking as well as data structure choices e.g. for sets as well as the core algorithm code not being generated (see src/Text/ANTLR/LR.hs).

cronburg / antlr-haskell

ALL(*) improvements intended to deal with GLR performance issues #40