dlang-community / Pegged

A Parsing Expression Grammar (PEG) module, using the D programming language.
534 stars 66 forks source link

Which parsing algorithm is used? Is it possible to add more efficient ones? #38

Closed jkm closed 12 years ago

jkm commented 12 years ago

I have a grammar where some input is parsed very slowly. Consider a grammar like:

Expr < Plus / Minus / Term
Plus ...
Minus ...

Term < Mul / Div / Factor
Mul ...
Div ...

Factor < UnaryMinus / UnaryPlus / Function
UnaryMinus ...
UnaryPlus ...

Function < BinaryFunction / UnaryFunction / Primary
BinaryFunction < Identifier '(' Expr ',' Expr ')'
UnaryFunction < Identifier '(' Expr ')'

Assume you want to parse the expression min(0, max(2, 3)). Then each alternative is checked first. E.g. before "min" is parsed 3 * 3 * 3. I.e it is especially costly if the last alternative will always succeed. I assume this why pegs can take exponential time to parse. I think a packrat parser solves those issues by some memory trade-off. For some (probably most) grammars a LR parser may work even better in practice. Are there any plans to support more efficient parsing schemes? Are they possible to integrate with the current design?

PhilippeSigaud commented 12 years ago

On Mon, Aug 13, 2012 at 11:35 AM, jkm notifications@github.com wrote:

Assume you want to parse the expression min(0, max(2, 3)). Then each alternative is checked first. E.g. before "min" is parsed 3 * 3 * 3. I.e it is especially costly if the last alternative will always succeed. I assume this why pegs can take exponential time to parse.

Yes.

I think a packrat parser solves those issues by some memory trade-off. For some (probably most) grammars a LR parser may work even better in practice. Are there any plans to support more efficient parsing schemes? Are they possible to integrate with the current design?

Yes also. Pegged is just a parser generator and I knew just a little bit about PEGs until now. I'm trying to become more literate in general parsing technology (LL(1), LALR, GLR, LL(*), and so on).

My current goal is to have PEG generate a LL(1) or LALR(1) parser. GLR could come afterwards, and a fully-backtracking recursive-descent parser, for ambiguous grammars. I'll see. The basic idea is the same: enter a grammar in a natural way and call grammar!(someCompileTimeArgument)(grammarText), where someCompileTimeArgument will most probably serve to select the parser generator.

It'll take a 2 months, at least, to get LALR and LL(1) working. I have some very basic code, but nothing functional right now.

jkm commented 12 years ago

I see. Nice to hear that you are already aware of it. Even though it makes may current development cycle slow for certain input I can stay away from these inputs for the time being and work on other things first.

PhilippeSigaud commented 12 years ago

On Mon, Aug 13, 2012 at 4:02 PM, jkm notifications@github.com wrote:

I see. Nice to hear that you are already aware of it. Even though it makes may current development cycle slow for certain input I can stay away from these inputs and work on other things.

I can try to implement packrat parsing. That's what should require the lowest amount of modifications.

jkm commented 12 years ago

It's not that urgent and having a LALR or LL(1) parser is more important in the long run IMHO. Currently it's sufficient for me to know where the problem is. I just was surprised by the run-time behavior. From my point you can safely follow your schedule and goals.