refine Python grammar - Githubissues

mw66 commented 2 years ago

A couple of fixes:

-- <- to < -- https://docs.python.org/3/reference/grammar.html in many places use:

  ','.expression+

I'm not sure if this is standard PEG syntax, anyway, translated into

  ((expression ("," expression)*)

Actually the grammar is still not working, the parsing output in the unittest is wrong:

Running python-test-library 
Python[0, 0][]
 +-Python.file[0, 0][]

However, I noticed Pegged is really slow, without 'lowmem' it run out of memory (after consume ~32GB); and with lowmem, it took ~5min to build. This make the trial-and-error process to debug very time consuming, esp. considering I'm a Pegged newbie.

I also tried to use the Python library parsimonious to help, and found it's really fast and helpful to identify errors; but unfortunately, due to this issue that parsimonious does not consume (skip) the white spaces by default:

https://github.com/erikrose/parsimonious/issues/193

it would make no sense if I try to add meaningless whitespaces rules _ to the current grammar, so I stopped here. I.e. parsimonious no longer complain about any errors, but I won't able to parse any real Python code to test.

@veelo can you help from here if you have time? maybe it's easier for your expert eyes to see the problem.

Also, I'm wondering if this compile-time parser generator worth it? or does Pegged have a library mode, i.e. generate parsing tree at runtime on the given grammar file. Will this be faster? For comparison, parsimonious parse (and report error) almost instantly in seconds, which makes the development process much faster.

Thanks.

mw66 commented 2 years ago

BTW, I found a working Python PEG parser here:

https://github.com/we-like-parsers/pegen/blob/main/data/python.gram

but it uses a different PEG syntax.

It's a working parser, and I have tried it to parse really complex py source code.

BTW, generate from Python PEG grammar to python_parser.py only take less than a second:

pegen/src$ time python -m pegen  ../data/python.gram -o ../data/python_parser.py
...

real    0m0.560s
user    0m0.531s
sys 0m0.017s

Once again, I'm amazed by how much resource (memory & cpu time) that Pegged requires.

veelo commented 2 years ago

For practical parsers, IMO you should always use https://github.com/PhilippeSigaud/Pegged/wiki/Grammars-as-D-Modules. Compile-time parsing is rarely useful. See https://github.com/PhilippeSigaud/Pegged/wiki/Using-a-Grammar#runtime-parsing for runtime parsing.

Also, I think your grammar lacks eoi, which it might need, see https://github.com/PhilippeSigaud/Pegged/wiki/Predefined-Parsers.

On speed, to things. Firstly, since the test and the grammar are in the same file, you are measuring both the time it takes to generate the parser and the time it takes to parse the input. Using asModule will help you separate the two. Secondly, Pegged is known for not being fast, sadly. I suspect that a big reason for that is that it is based on string-based switch statements. It would be very interesting to consider a redesign of Pegged to switch on enums, but I don't have that kind of time...

You might find some help in https://github.com/PhilippeSigaud/Pegged/wiki/Optimizations. You can also enable tracing to get an insight into the parsing process.

dlang-community / Pegged

refine Python grammar #310