erikrose / parsimonious

The fastest pure-Python PEG parser I can muster
MIT License
1.8k stars 126 forks source link

Control characters cause parsing errors #163

Closed davaya closed 4 years ago

davaya commented 4 years ago

I have a grammar that parses a file containing multiple "records" separated by a sentinel character (pipe | in the example code).

One comment on Issue #156 suggests using a greek Delta character as a sentinel to minimize the chance of collisions with parsed text, but ASCII and Unicode define a Record Separator character for exactly that purpose, eliminating the possibility of collisions. (https://en.wikipedia.org/wiki/Delimiter#ASCII_delimited_text)

But when I try to use RS instead of pipe as a separator, parsimonious fails. The example code gives the expected set of <Node called "x" matching ... messages with the pipe separator, but generates an exception parsimonious.exceptions.ParseError: Rule 'rs' didn't match at '' (line 10, column 1). with the RS separator.

from parsimonious.grammar import Grammar
text1 = """

a1:b1|a2:b2|
a3:b3|
a4:message$ with! punctuation[)]/.^
a5:more)

(*&^%$#@@D!|

"""
gram1 = """
    lines = line+
    line  = ~"[^|]*" rs
    rs    = "|" _
    _     = ~"[ \\n\\r\\t]*"
    """
text2 = text1.replace('|', r'\x1e')
gram2 = gram1.replace('|', r'\x1e')
print('===1:\n', Grammar(gram1).parse(text1))
print('===2:\n', Grammar(gram2).parse(text2))

Is there any reason why one should work and not the other?

davaya commented 4 years ago

Closed due to operator error - r'\x1e` is not the same as chr(30), and the latter works correctly.

repr(chr(30)) = '\'\\x1e\''

repr(r'\x1e') = '\'\\\\x1e\''