erikrose / parsimonious

The fastest pure-Python PEG parser I can muster
MIT License
1.79k stars 126 forks source link

Matching being too greedy #185

Open rowlesmr opened 2 years ago

rowlesmr commented 2 years ago

I've just found this library, and I'm trying to implement parsimonious to at least tokenise my input files. There is a grammar I'd like to implement, but the matching seems to be a little too greedy.

One example of this nature is below.

Strings contained in quotes are able to contain the quote character itself. To be a valid string-termination quote mark, the quote mark must be followed by whitespace.

from parsimonious.grammar import Grammar
quote_grammar = Grammar(
    r"""
    SingleQuotedString = single_quote Char* single_quote WhiteSpace
    WhiteSpace = ( SP / HT / EOL )+
    single_quote = "'"  

    Char = ~"[a-zA-Z0-9' \t]"

    EOL = ~'[\n\r\f]+'
    SP = ' '
    HT = '\t'
    """
)

print(quote_grammar.parse("'don't' "))

This fails with

parsimonious.exceptions.ParseError: Rule 'single_quote' didn't match at '' (line 1, column 9).

Implementing the first rule as a straight regex does work

quote_grammar = Grammar(
    r"""
    SingleQuotedString = ~"['][a-zA-Z0-9' \t]*['][ \t\n\r\f]+"
    """

 -->
<RegexNode called "SingleQuotedString" matching "'don't' ">

Where can I start looking on how to fix this behaviour (if indeed it is a bug...)

comalice commented 2 years ago

Char* in SingleQuotedString = single_quote Char* single_quote WhiteSpace appears to be the thing that is "too greedy" because your second single_quote has nothing to match on at the end of your string ("column 9", the last index of the string). According to some resources I read

As a regular expression \[.*\] matches the longest possible text between '[' and ']'. As a PEG it never matches anything, because a PEG is deterministic: .* consumes the rest of the input, so \] never matches. As a PEG this needs to be written as: \[ ( !\] . )* \] (or \[ @ \]). [0]

Note that the regular expression does not behave as intended either: in the example * should not be greedy, so \[.*?\] should be used instead.

It looks like you should make use of lookaheads to get the behaviour you need; something like

SingleQuotedString = single_quote (!end_single_quote Char)* end_single_quote
end_single_quote = single_quote Whitespace

[0] https://nim-lang.org/docs/pegs.html

rowlesmr commented 2 years ago

I see. Make the two token search for "' " into a single token search by making it it's own match.

Ta a lot.

I'll try it out.

lucaswiman commented 2 years ago

Note that this is regular, so you could do this with just a re.

import re
pattern = re.compile(
    r"""
        (  # start capture group
            (?<!\w)'  # Starting ', which should not be preceded by a word character.
                      # Uses negative lookbehind.
            (?:[^']|'(?=\w))*  # A non-' character, or a ' which is followed by a word character.
                               # Uses positive lookahead.
            '(?!\w)  # Closing ', not followed by a word character.
                     # Uses negative lookahead.
        )  # end capture group
    """,
    flags=re.VERBOSE
)
assert pattern.findall("'do' 'don't'") == ["'do'", "'don't'"]
assert pattern.findall("'don't'") == ["'don't'"]
assert pattern.findall("'don't") == []

(Note also that this general approach may break on some slang words which end in ', like goin'.)