Open MrTomKimber opened 4 years ago
@MrTomKimber I made a repository for benchmarking various Python libraries against each other: https://github.com/goodmami/python-parsing-benchmarks
I have a contender in the benchmarks (pe) so I'm not disinterested, but I welcome improvements to the other parsers so each can have a best-showing. Parsimonious currently only has a JSON parser (link), but I hope to eventually have parsers for the other two tasks (basic arithmetic and INI files) as well as add more tasks. Contributions are welcome!
I'm still developing the documentation for pe, but it currently has a list of common patterns for basic things like strings and numbers. Is this what you had in mind? If so, perhaps Parsimonious can create a similar list. Another strategy is exemplified by Lark, which maintains a list of common patterns as a grammar file that is importable by other grammars.
pe, Lark, pyparsing, and other parsing libraries also all have examples/
subdirectories containing reference grammars. I find these very useful, as well.
@goodmami - yes, it's kind of close to what I'm looking for - common themes like basic arithmetic could be written in some standard form, such that if you wanted to implement that functionality, you could copy/paste that block of grammar (subject to watching for namespace clashes and other overlapping problems) and have some of the hard work done for you.
The difficulty I find myself wrestling with for the most part with parsimonious is rewriting grammar trees so that they avoid being left-recursive - there's probably a knack to that, but it's a knack I'd be happy to leave unlearned if there were a repository of common parsing patterns to call on - even if it's just to find working examples that people have wrestled with already.
I'll take a closer look at your repo - and it could well be the case that there are solutions in there which even if they are built for different tools, could be transpiled into parsimonious form - and, if they too have this left-recursive constraint, they should provide insight into common solutions to what are likely to be a fairly small domain of common use-case patterns.
Further - common regex formats for strings, decimals etc is a helpful resource - especially in terms of dealing with escape characters, masking quotation marks and other such vagaries. I tend to try and skirt some of this problem by applying a few rounds of pre-processing find-replace to explicitly tokenise otherwise hard-to-parse content - but that's probably cheating.
Thanks for the reply.
I've only written a few Parsimonious grammars so I'd be very happy to have someone more experienced contribute implementations for the other benchmark tasks. Even for the JSON implementation I already have, it is the second slowest after pyparsing (which I am not an expert user of, either), so I'd like to think there are ways to make it better.
Regarding left-recursive grammars, yes, it's not always immediately obvious when something is potentially left-recursive. Some parsers, such as Lark, are bottom-up parsers and have no trouble with left-recursion. There are also strategies the parser can use for handling left-recursion, such as using memoization to escape the loop (this strategy is employed by pegen).
If you're looking to transpile grammars from other frameworks, I'd start with pe, as it is also PEG-based. Maybe pyparsing is as well, but it has many special extensions and it's Python-based syntax would be harder to convert, I think. Pe also has the left-recursive constraint, but it handles value transformations a bit differently because it doesn't build a tree.
I've written a couple of different grammars now, and notice that there's a degree of overlap and oft-repeated patterns - e.g. constructing lists from elements, or capturing infix mathematical or logical evaluations.
It would be great to be able to browse through some common libraries - getting ideas on how others have approached similar issues, or even better, download grammars built against common standards - e.g. ANSI SQL, C standards, Python, or other programming languages etc.
Is anyone aware of such a resource, and would it be possible to construct one as an adjunct to this library?