erikrose / parsimonious

The fastest pure-Python PEG parser I can muster
MIT License
1.8k stars 126 forks source link

Problems using backslash character as literal? #179

Closed jenstroeger closed 2 years ago

jenstroeger commented 3 years ago

It looks like declaring a \ backslash character as a literal yields problems? For example, using a simple grammar:

>>> parsimonious.Grammar("""
... cmd = esc ~"[a-zA-Z]"+
... esc = "!"
... """)
Grammar('b'cmd = esc ~"[a-zA-Z]"u+\\nesc = "!"'')

works when I want to use ! as a “command escape” character. However…

>>> parsimonious.Grammar("""
... cmd = esc ~"[a-zA-Z]"+
... esc = "\\"
... """)
Traceback (most recent call last):
  ...
  File ".../lib/python3.7/site-packages/parsimonious/expressions.py", line 122, in parse
    raise IncompleteParseError(text, node.end, self)
parsimonious.exceptions.IncompleteParseError: Rule 'rules' matched in its entirety, but it didn't consume all the text. The non-matching portion of the text begins with 'esc = "\"
' (line 3, column 1).

If I try to use a regex:

>>> parsimonious.Grammar("""
... cmd = esc ~"[a-zA-Z]"+
... esc = ~r"[\\]"
... """)
Traceback (most recent call last):
  ...
  File "/.../lib/python3.7/sre_parse.py", line 526, in _parse
    source.tell() - here)
re.error: unterminated character set at position 0

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  ...
  File "/.../lib/python3.7/sre_parse.py", line 526, in _parse
    source.tell() - here)
parsimonious.exceptions.VisitationError: error: unterminated character set at position 0

Parse tree:
<Node called "regex" matching "~r"[\]"
">  <-- *** We were here. ***
    <Node matching "~">
    <Node called "spaceless_literal" matching "r"[\]"">
        <RegexNode matching "r"[\]"">
    <RegexNode matching "">
    <Node called "_" matching "
    ">
        <Node called "meaninglessness" matching "
        ">
            <RegexNode matching "
            ">
>>> 

Related to issue https://github.com/erikrose/parsimonious/pull/135?

lucaswiman commented 2 years ago

I'd recommend always using r"""strings""" to define your grammar. It's also a good idea to use r"strings" in string literals for regex expressions. Under the hood there are a few levels of \\\\escaping being unpacked, so to include backslash literals, you need to escape them several times. However, with r-strings, both of your grammars compile:

>>> g1 = Grammar(r"""
... cmd = esc ~"[a-zA-Z]"+
... esc = ~r"[\\]"
... """)
>>>
>>> g2 = Grammar(r"""
... cmd = esc ~"[a-zA-Z]"+
... esc = "\\"
... """)
>>> g1.parse('\\abcd')
s = '\\abcd'
Node(<Sequence cmd = esc ~'[a-zA-Z]'u+>, s, 0, 5, children=[RegexNode(<Regex esc = ~'[\\\\]'u>, s, 0, 1), Node(<OneOrMore ~'[a-zA-Z]'u+>, s, 1, 5, children=[RegexNode(<Regex ~'[a-zA-Z]'u>, s, 1, 2), RegexNode(<Regex ~'[a-zA-Z]'u>, s, 2, 3), RegexNode(<Regex ~'[a-zA-Z]'u>, s, 3, 4), RegexNode(<Regex ~'[a-zA-Z]'u>, s, 4, 5)])])
>>> g2.parse('\\abcd')
s = '\\abcd'
Node(<Sequence cmd = esc ~'[a-zA-Z]'u+>, s, 0, 5, children=[Node(<Literal esc = '\\'>, s, 0, 1), Node(<OneOrMore ~'[a-zA-Z]'u+>, s, 1, 5, children=[RegexNode(<Regex ~'[a-zA-Z]'u>, s, 1, 2), RegexNode(<Regex ~'[a-zA-Z]'u>, s, 2, 3), RegexNode(<Regex ~'[a-zA-Z]'u>, s, 3, 4), RegexNode(<Regex ~'[a-zA-Z]'u>, s, 4, 5)])])

The docs should maybe be updated to reflect this advice.