Open rowlesmr opened 2 years ago
Char*
in SingleQuotedString = single_quote Char* single_quote WhiteSpace
appears to be the thing that is "too greedy" because your second single_quote
has nothing to match on at the end of your string ("column 9", the last index of the string). According to some resources I read
As a regular expression
\[.*\]
matches the longest possible text between '[' and ']'. As a PEG it never matches anything, because a PEG is deterministic:.*
consumes the rest of the input, so\]
never matches. As a PEG this needs to be written as:\[ ( !\] . )* \]
(or\[ @ \])
. [0]Note that the regular expression does not behave as intended either: in the example
*
should not be greedy, so\[.*?\]
should be used instead.
It looks like you should make use of lookaheads to get the behaviour you need; something like
SingleQuotedString = single_quote (!end_single_quote Char)* end_single_quote
end_single_quote = single_quote Whitespace
I see. Make the two token search for "' " into a single token search by making it it's own match.
Ta a lot.
I'll try it out.
Note that this is regular, so you could do this with just a re
.
import re
pattern = re.compile(
r"""
( # start capture group
(?<!\w)' # Starting ', which should not be preceded by a word character.
# Uses negative lookbehind.
(?:[^']|'(?=\w))* # A non-' character, or a ' which is followed by a word character.
# Uses positive lookahead.
'(?!\w) # Closing ', not followed by a word character.
# Uses negative lookahead.
) # end capture group
""",
flags=re.VERBOSE
)
assert pattern.findall("'do' 'don't'") == ["'do'", "'don't'"]
assert pattern.findall("'don't'") == ["'don't'"]
assert pattern.findall("'don't") == []
(Note also that this general approach may break on some slang words which end in ', like goin'.)
I've just found this library, and I'm trying to implement parsimonious to at least tokenise my input files. There is a grammar I'd like to implement, but the matching seems to be a little too greedy.
One example of this nature is below.
Strings contained in quotes are able to contain the quote character itself. To be a valid string-termination quote mark, the quote mark must be followed by whitespace.
This fails with
Implementing the first rule as a straight regex does work
Where can I start looking on how to fix this behaviour (if indeed it is a bug...)