erikrose / parsimonious

The fastest pure-Python PEG parser I can muster
MIT License
1.81k stars 127 forks source link

Problems with alternating #83

Closed cknv closed 7 years ago

cknv commented 8 years ago

I am having a problem with an alternation in a grammar I wrote.

I managed to reduce the case to a simple snippet (where even regex is probably overkill), with a couple of trivial examples:

from parsimonious.grammar import Grammar

G = Grammar(r"""
element = word / wildcard

wildcard = ~"[*\w]+"
word = ~"\w+"
space = ~"\s+"
""")

cases = [
    'toast',
    '*toast',
    't*oast',
    'to*st',
]

for case in cases:
    G.parse(case)

The first case works fine, the second also (which puzzles me a bit), but the third and the fourth does not, and blows up with a IncompleteParseError, with the remainder being *oast and *st. However if I change the order of the element to element = wildcard / word, it parses just fine. So it seems that while word fails (as expected), it does not try to parse with wildcard, before giving up and just raising an error instead.

While I have been able to work my way around it, and just defer deciding what it is until after parsing, It would be nice if I could tighten up my grammar a little.

Of course I cannot quite rule out that I am just doing something stupid, so if it just me, I hope that someone else can learn from my mistake :)

jwhitlock commented 8 years ago

Add a "one or more" rule to parse the full strings:

G = Grammar(r"""
elements = element+
element = word / wildcard
...
cknv commented 8 years ago

While that does indeed make the reduced example parse, it does not really solve the case behind this. Perhaps I reduced the example too much :)

My actual grammar is more like this (although still simplified):

G = Grammar(r"""
clauses = elements / element
elements = element (space element)+
element = word / wildcard
...

The thing is, it's not quite enough for me, to just get it parsing (sorry about not mentioning this at first), I also need to have the whole wildcard parsed as one node. Currently I also need the expr_name of the node, as I use it in my post processing steps, which I must admit I implemented before I really read about node visiting (which I completely seemed to miss when first implementing it), but maybe that requirement can be forfeited - I am not entirely sure.

cknv commented 7 years ago

Just to clarify the problem I tried to describe here, now that I have learned something more about it:

The problem stems from partial matching. Say I have a text parsimon*, that will obviously match the "wildcard" regex, but since that also consumes all regular words, I'd like to try the more narrow "word" regex first, and then if that does not work fall back and try the next option.

What actually happens is that the narrower word regex consumes part of the text, at which point parsimonious errors with a IncompleteParseError since the regex only consumed part of the text and the next that follows it is suppose to be some whitespace.

In the end I was able to make a regex that looked ahead, to make sure that it consumed the whole element, or nothing at all. But that regex sticks out like a sore thumb, as the other bits of regex I use are rather simple and neat.

I am no expert on writing parsers, so I don't know if my idea to fix this problem I encountered is hard to implement or maybe even plain stupid, but maybe instead of raising on an incomplete parse, the parser could try the next option, until one of them works. Or of course fail, if nothing worked.

In any case, I hope I better explained what caused me to write this in the first place.

erikrose commented 7 years ago

First, a tip: put an "r" in your regexes (~r'...'), and you'll avoid surprises where shortcuts like \w don't mean what you think they mean. You might be okay at the moment.

So, it sounds like you want to parse a series of space-delimited words. Thus, what you need is a production at some level which represents the whole word, like token below. You can do this by including the space. Here's a sketch. I haven't tested it, but it should give you the idea.

tokens = token*
token = (word / wildcard) end_of_token
end_of_token = ~r"\s*|$"

wildcard = ~"[*\w]+"
word = ~"\w+"

Please reopen this if you still have trouble!