erikrose / parsimonious

The fastest pure-Python PEG parser I can muster
MIT License
1.8k stars 126 forks source link

python PEG grammar #177

Open adsharma opened 3 years ago

adsharma commented 3 years ago

Python has a PEG grammar here:

https://github.com/python/cpython/blob/master/Grammar/python.gram

That grammar uses a slightly different format. I'm looking to parse it using parsimonious. My script massages the grammar above to something close to what this module expects. But two issues remain:

cpython uses ':' for rules and you seem to use '=' cpython uses '|' for alternatives and you seem to use '/'

Has anyone looked into reconciling these two and using the package to parse python code itself?

adsharma commented 3 years ago

Cleaned up grammar produced by my script:

https://paste.ubuntu.com/p/ftbMmhB5fV/

goodmami commented 3 years ago

Python's new PEG parser ("pegen") and its syntax is described here: https://www.python.org/dev/peps/pep-0617/#syntax

The syntax is based on the older LL(1) ("pgen") parser, and the same syntax is retained and extended for pegen because, apparently, GvR likes it (source). So : is equivalent to = and | is equivalent to /.

More interesting is that pegen is not a scannerless PEG parser (e.g., note that NAME is not defined by the grammar). It must first tokenize the input, then it uses the PEG rules to parse the tokens. See https://docs.python.org/3/library/token.html for the valid tokens. If you want to parse Python character by character, you'll need to write rules for those tokens as well.