erikrose / parsimonious

The fastest pure-Python PEG parser I can muster
MIT License
1.79k stars 126 forks source link

Or operator issue #227

Closed wbhart closed 1 year ago

wbhart commented 1 year ago

Thanks for this library, but I'm having a little problem with a simple grammar which seems to ignore the or operator (/).

Here is the little grammar I constructed:

parsehypothesis = Grammar( """ hypothesis = sum / union / elem sum = (var "+" )* var union = (var "\\\\cup" )* var elem = var "\\\\in" var var = ~"[A-Z][A-Z]*" = ~r"\s*" """)

It will accept inputs of the form:

A + B

but not of the form

A \cup B

or of the form

A \in B

I am sure I'm doing something very stupid, and I'm wondering if someone could point me in the right direction. If I switch the order of the possibilities in "hypothesis" it will always allow the first one, but not the other two.

lucaswiman commented 1 year ago

Once a node matches, PEGs don't do backtracking in the same way as CFGs. So A \cup B is matching sum on the A part, then saying the rule matched and consumed all the text, but there's still some text remaining. I'd suggest making the operator non-optional and considering a single variable as a different kind of expression:

parse_hypothesis = Grammar(
r"""
hypothesis = sum / union / elem / var
sum = (var _ "+" _)+ var
union = (var _ "\\cup" _)+ var
elem = var _ "\\in" _ var
var = ~"[A-Z][A-Z]*"
_ = ~r"\s*"
""")

(As an aside, the \escaping is much less counterintuitive when you pass grammars an r"""string""")

That parses all these examples:

>>> parse_hypothesis.parse("A+B+C")
s = 'A+B+C'
Node(<OneOf hypothesis = sum / union / elem / var>, s, 0, 5, children=[Node(<Sequence sum = (var _ '+' _)+ var>, s, 0, 5, children=[Node(<Quantifier (var _ '+' _)+>, s, 0, 4, children=[Node(<Sequence var _ '+' _>, s, 0, 2, children=[RegexNode(<Regex var = ~'[A-Z][A-Z]*'u>, s, 0, 1), RegexNode(<Regex _ = ~'\\s*'u>, s, 1, 1), Node(<Literal '+'>, s, 1, 2), RegexNode(<Regex _ = ~'\\s*'u>, s, 2, 2)]), Node(<Sequence var _ '+' _>, s, 2, 4, children=[RegexNode(<Regex var = ~'[A-Z][A-Z]*'u>, s, 2, 3), RegexNode(<Regex _ = ~'\\s*'u>, s, 3, 3), Node(<Literal '+'>, s, 3, 4), RegexNode(<Regex _ = ~'\\s*'u>, s, 4, 4)])]), RegexNode(<Regex var = ~'[A-Z][A-Z]*'u>, s, 4, 5)])])
>>> parse_hypothesis.parse("AA \\cup BB")
s = 'AA \\cup BB'
Node(<OneOf hypothesis = sum / union / elem / var>, s, 0, 10, children=[Node(<Sequence union = (var _ '\\cup' _)+ var>, s, 0, 10, children=[Node(<Quantifier (var _ '\\cup' _)+>, s, 0, 8, children=[Node(<Sequence var _ '\\cup' _>, s, 0, 8, children=[RegexNode(<Regex var = ~'[A-Z][A-Z]*'u>, s, 0, 2), RegexNode(<Regex _ = ~'\\s*'u>, s, 2, 3), Node(<Literal '\\cup'>, s, 3, 7), RegexNode(<Regex _ = ~'\\s*'u>, s, 7, 8)])]), RegexNode(<Regex var = ~'[A-Z][A-Z]*'u>, s, 8, 10)])])
>>> parse_hypothesis.parse("A")
s = 'A'
Node(<OneOf hypothesis = sum / union / elem / var>, s, 0, 1, children=[RegexNode(<Regex var = ~'[A-Z][A-Z]*'u>, s, 0, 1)])
>>> parse_hypothesis.parse("A \\in B")
s = 'A \\in B'
Node(<OneOf hypothesis = sum / union / elem / var>, s, 0, 7, children=[Node(<Sequence elem = var _ '\\in' _ var>, s, 0, 7, children=[RegexNode(<Regex var = ~'[A-Z][A-Z]*'u>, s, 0, 1), RegexNode(<Regex _ = ~'\\s*'u>, s, 1, 2), Node(<Literal '\\in'>, s, 2, 5), RegexNode(<Regex _ = ~'\\s*'u>, s, 5, 6), RegexNode(<Regex var = ~'[A-Z][A-Z]*'u>, s, 6, 7)])])
wbhart commented 1 year ago

Ah thanks for the explanation. Somehow I'd forgotten this about PEGs. I'd convinced myself that the reason one needed memoisation to get linear time in a packrat parser was because of backtracking. So apparently I somehow misunderstood something since I last looked at these things, which was years ago.

wbhart commented 1 year ago

Oh, actually I see the issue clearly now. One of the rules was fully matched because an expression is sufficient for that, so it never tried the other rules. That totally makes sense and doesn't contradict what I thought I remembered about backtracking.

Thanks also for the tip about the r""" string and the suggestion for how to modify the grammar to do what I want. That's indeed what I need to do in general.