erezsh / plyplus

a friendly yet powerful LR-parser written in Python
267 stars 56 forks source link

Is it possible to parse Python grammar using recursive descent? #46

Closed pfalcon closed 6 years ago

pfalcon commented 6 years ago

I'm sorry of this looks like a random offtopic question. Indeed, it mostly is. But this question drills my mind for a while, and google is not helpful with it at all - it returns hits for parsers written in Python, not parsers for Python grammar.

So, seeing it in README:

It supports post-tokenizing code, so it's capable of parsing python (it comes with a ready-to-use python parser).

, I wonder, would you have an idea if it's possible to parse Python with a recursive descent parser, without explicit AST construction?

Why I'm asking: in https://github.com/micropython/micropython/, AST construction accounts for a large part of a limit on maximum file size parsed. While using AST allows for cool things like compile-time optimizations, it would be nice to have a memory-efficient alternative parser too.

Thanks in advance for any hints!

erezsh commented 6 years ago

It's possible to parse Python with recursive descent, but I recommend that you take a better approach, and parse it with LALR(1). For reference, here is Lark's entire LALR(1) parser implementation (<100 lines), which I wrote: https://github.com/erezsh/lark/blob/master/lark/parsers/lalr_parser.py

It's a simple state-machine + stack, so it shouldn't be too hard to translate it into C. LALR(1) runs in O(n), and it's very suitable for C, which makes it literally the fastest algorithm you can write. It also uses very little memory, and it's probably the best in that as well.

Of course, a parse table has to be created, but that can be done in compile time, and the resulting table is relatively small. You can even use Lark to generate it.

You'll also have to implement a lexer that can count indentations, but that's relatively straightforward, especially if you use regexps.

pfalcon commented 6 years ago

OMG, thanks for quick response! I missed a notification and only now remembered that I've asked it (blush).

It's possible to parse Python with recursive descent

Great, thanks for reassuring me on that, maybe then one day I'll try to implement it...

but I recommend that you take a better approach, and parse it with LALR(1). It's a simple state-machine + stack

Right, and MicroPython already implements one of LR parsers using explicit stack automaton: https://github.com/micropython/micropython/blob/master/py/parse.c , https://github.com/micropython/micropython/blob/master/py/grammar.h#L27

The problem is that managing explicit stack still leads to overhead, the biggest of which being memory fragmentation overhead. Just consider that MicroPython is designed for systems with e.g. 16KB of total memory. Say, 4K of that is reserved for C stack, and mere 12KB for heap. If we start to allocate parser stack on the heap, heap becomes fragmented, and user app can't allocate large enough objects. Instead, all parsing state could be kept on C stack, and that's exactly what recursive decent parser does.

Summing up, there appears to be 3 choices:

  1. LR parser + AST - current MicroPython's approach.
  2. Keep LR (LALR) parser, but avoid AST and generate bytecode directly.
  3. Use LL/recursive decent parser and generate bytecode directly.

3rd approach seems to be the most frugal, and that's my idea - if going for implementing an alternative parser, go for a solution which promises the most savings. Maybe it needs more consideration though.

Thanks again for discussion!