dlang-community / Pegged

A Parsing Expression Grammar (PEG) module, using the D programming language.
534 stars 66 forks source link

Ambiguous grammar handling #273

Open eraxillan opened 4 years ago

eraxillan commented 4 years ago

Hello and thanks for awesome library!

I'm trying to implement Qt's qmake project files parser from scratch.
But i've found two issues:

So the qmake language cannot be described using PEG.
However, i've managed to reinvent a wheel add "preprocessor" code eliminating those issues using naive parser.
It's just replaces "delimiter colon" with "@" char, and enquote regular expressions.
The resulting code already can be parsed using PEG/pegged.

Does pegged have some built-in stuff to help fighting with such kind of ambiguous grammars?
Or just i chose the wrong tool. Thanks!

P.S. Link to my project using pegged

WebFreak001 commented 4 years ago

use | instead of / to match longest match instead of first match (for ambiguous rules) or improve your grammar. Would need a more concrete example what you parse and what you expect (and how it can be different depending on other code) If it is dependent on semantic meaning you might be able to make it work with semantic actions.

for your regex you might want to try some rules like

Regex < RegexPart+
RegexPart < RegexPart2 RegexCountModifier?
RegexPart2 < RegexCaptureGroup / RegexBracketGroup / RegexChar / RegexMatchStart / RegexMatchEnd
RegexCountModifier < '?' / '*' / '+' / '{' UInt ',' UInt '}' / '{' UInt ',' '}' / '{' UInt '}'
RegexChar <~ backslash (
            / 'x' Hex Hex
            / 'u' Hex Hex Hex Hex
            / 'U' Hex Hex Hex Hex Hex Hex Hex Hex
            / .
           )
         / !RegexSpecialChar .
RegexMatchStart < '^'
RegexMatchEnd < '$'
RegexSpecialChar < '?' / '^' / '$' / '*' / '+' / '{' / '}' / '[' / ']' / '(' / ')'
RegexBracketGroup < '[' RegexInvertGroup? RegexIncludeClosingBracket? RegexBracketPart* ']'
RegexInvertGroup < '^'
RegexIncludeClosingBracket < ']'
RegexBracketPart < RegexBracketRange / RegexBracketChar
RegexBracketRange < RegexBracketChar '-' RegexBracketChar
RegexBracketChar < !']' .
RegexCaptureGroup < '(' RegexGroupModifier? RegexPart+ ')'
RegexGroupModifier < RegexPositiveLookahead / RegexNegativeLookahead / RegexNonCaptureGroup
RegexPositiveLookahead < '?='
RegexNegativeLookahead < '?!'
RegexNonCaptureGroup < '?:'
UInt <~ [0-9]+

I guess. (not well tested, best to make your own grammar to fully understand) I don't see how can it contain unbalanced parentheses?

Based my grammar mostly on a very lightweight PCRE syntax now, not sure what you have.

eraxillan commented 4 years ago

@WebFreak001 thanks for your answer! and sorry for such delay.

First of all, my grammar description.
1) Unbalanced parenthesis example:

m = $$replace(out, ".*\\$\\(EXPORT_([^)]+)\\).*", \\1)
#                                                         ^
#                                                       oh no! unbalanced
# this code replace the value of 'out' variable using given regex and set it to 'm'

So, you think i should use regex grammar for such kind of parameter?

2) Scope statement ambiguity example:

win32 : msvc                    :               CONFIG += win_stuff
#       ^                              ^
#  logical AND      end of logical statement
# this code add value 'win_stuff' to CONFIG variable only if bool condition 'win32 AND msvc' is true

Scope expression described on line 181:

 Scope             <- BooleanExpression ScopeMainBranch ScopeElseIfBranch* ScopeElseBranch?

Well, i'll think where i can use | operator here.

WebFreak001 commented 4 years ago

for 1 you should implement a proper string parser (as you have the value quoted, which would mean it couldn't be unbalanced parentheses if done properly), look at https://github.com/PhilippeSigaud/Pegged/blob/master/pegged/examples/strings.d

if your values are unquoted, look at my regex grammar I posted which will make it balanced too

for 2: if it is always in this order (everything before last colon is logical and) then it is easy to describe as rule using lookaheads like

Config     < Statement+
Statement  < Checks ":" Expression
                            # v  the & does the magic
Checks     < Check (":" Check &":")*
Check      < identifier
Expression < Variable Operator Value

if it's not always in this order, and dependent on the name of the identifier you write in there (being any arbitrary variable you defined earlier) you will either parse it as one list and resolve it later as part of your libraries API or try to implement it using semantic actions.