TheLartians / PEGParser

💡 Build your own programming language! A C++17 PEG parser generator supporting parser combination, memoization, left-recursion and context-dependent grammars.
BSD 3-Clause "New" or "Revised" License
240 stars 21 forks source link

Performance tuning? #51

Closed ColinH closed 4 years ago

ColinH commented 4 years ago

In order to gauge the performance of the PEGParser library with a simple-but-real-world grammar I modified the included example/json_parser.cpp to

Using the three data files from the Native JSON Benchmark as inputs I then compared the performance of

  1. the resulting modified example/json_parser.cpp,
  2. the benchmark utility from taoJSON that uses the example JSON grammar from the PEGTL,
  3. the benchmark utility from taoJSON that uses the optimised-but-more-complicated grammar from that library and also builds the in-memory JSON representation.

Initial results show that program 1 is about 1000-5000x slower than 2, and 100-1000x slower than 3, which probably means that I'm doing something wrong. Towards tuning things, how can I disable memoization? The readme states that this is possible on a per-rule basis, but I couldn't find any documentation on how this might be achieved. And are there any other things I could change to increase performance?

Have you ever benchmarked your library against itself with memoization enabled and disabled? We, the authors of the PEGTL, have been asked about adding memoization on multiple occasions, which was also the main motivation for doing these benchmarks in the first place, however our feeling was that while in theory linear complexity with memoization is better than the quadratic complexity of the straight-forward recursive-descent approach, in practice the overhead of memoization combined with the fact that most real-world grammars by far don't hit the worst-case complexity let the simpler approach win.

Thank you for any insight.

TheLartians commented 4 years ago

Hey @ColinH, thanks for the issue! Tbh I'm not surprised about the low performance, especially for simple languages such as JSON. PEGParser was not designed for performance, but rather for ease-of-use and rapid prototyping of complex languages, and has the ability to build new parsers at run-time. Because of this it cannot possibly compete with optimised hand written parsers or compile time libraries such as PEGTL.

About memoization, it's a requirement for this project as it allows parsing naive grammars containing left recursion (without rewriting them) and otherwise the time-complexity would get out of hand fairly quickly. I might look into benchmarking and optimising this library in the future but that hasn't been a priority so far.

That said, I'm currently using PEGParser in production on a mobile App containing thousands of LOC in complex custom programming language (containing both left and right associative binary operators defined recursively) and it's not even close to being a performance bottleneck there. I've actually tried implementing a subset of the grammar in a third-party, non-memoizing parsing library and immediately had noticeable performance problems with complex input strings.

For PEGTL though, I agree that memoization should not be considered. In the readme, the project is stated as aiming for "efficiency and simplicity", one of which would likely need to be sacrificed for adding memoization. Most grammars can instead easily be rewritten to achieve great performance on recursive-descent parsers. (after being prototyped in this library, for example 😉)

ColinH commented 4 years ago

Hi and thanks for your response! Right, different design goals, different solutions and trade offs, and I'm happy that you confirm our intuition regarding memoization. Our "solution" to left-recursion is to ship a "grammar analysis" that detects (probably) all grammar rules that can be part of an infinite loop without progress, but it is up to the user to fix the issue by changing the grammar. If I ever need to create grammars at runtime I'll take a much closer look at your library as that is the one thing totally incompatible with our approach.

TheLartians commented 4 years ago

Glad to hear, the left-recursion detection sounds like a great solution to avoid runtime bugs! If I ever need high performance parsing I'll be sure to check out PEGTL as well. :)

ColinH commented 4 years ago

What is useful about the "grammar analysis" is that it detects a lot more than left-recursion, it also catches mistakes like e.g. foo ::= bar+ with bar ::= ' '*. Happy hacking :-)