bskinn / pent

pent Extracts Numerical Text -- Mini-language driven parser for structured numerical data in text
MIT License
20 stars 3 forks source link

Fix lookbehind/lookahead in optional-line token #89

Closed bskinn closed 5 years ago

bskinn commented 5 years ago

While #51 and #52 are okay for now, eventually this should really be fixed.

Consider the following:

>>> prs = pent.Parser(head=("@.foo", "?"), body="#!+.i")
>>> prs.pattern()
'(?P<head>(^|(?<=\\n))[ \\t]*(?<![a-zA-Z0-9deDE+.-])foo(?![a-zA-Z0-9deDE+.-])[ \\t]*($|(?=\\n))\\n?(
^|(?<=\\n))([ \\t]*[ \\t]*)?($|(?=\\n)))\\n?\\n?(?P<body>(^|(?<=\\n))[ \\t]*(?<![a-zA-Z0-9deDE+.-])[
+-]?\\d+([ \\t]+[+-]?\\d+)*(?![a-zA-Z0-9deDE+.-])[ \\t]*($|(?=\\n))(\\n?(^|(?<=\\n))[ \\t]*(?<![a-zA
-Z0-9deDE+.-])[+-]?\\d+([ \\t]+[+-]?\\d+)*(?![a-zA-Z0-9deDE+.-])[ \\t]*($|(?=\\n)))*)'

I think the problematic part part is here, near the end of the regex generated from head:

($|(?=\\n))\\n?(^|(?<=\\n))([ \\t]*[ \\t]*)?($|(?=\\n)))\\n?\\n?

Broken apart:

($|(?=\\n))      # Lookahead from the 'foo' line for EOL/EOF
\\n?             # Newline following the 'foo' line
(^|(?<=\\n))     # Lookbehind for BOL/BOF
([ \\t]*[ \\t]*)?    # Optional whitespace preceding/trailing the null content of the "?" pattern
($|(?=\\n)))   # Lookahead from the optional line for EOL/EOF (HERE'S THE PROBLEM); unmatched paren is from (?P<head> ...
\\n?\\n?   # Optional newline after optional line. There SHOULDN'T(?) be two newlines here???

When the optional line is absent, once the regex engine has matched \\n? in the text its position has advanced to the beginning of the first body line, within the text. The lookbehind matches just fine, as does the entirely optional content of the optional line. However, since the regex engine has advanced to the beginning of the first line of the body, there is now not a newline for the lookahead to match. This lookahead should thus be optional, since the whole line is optional.

ALTERNATIVELY, it may be that the entire optional line, including its initial lookbehind, trailing lookahead, and trailing \\n? should be enclosed in a single (...)? construction? A quick test suggests that this may also work. Boils down to which is easier to implement; or, if there's some sort of interaction with the various optional newlines between the various line-content matches.

The doubled optional newline at the end of the pattern snip is a separate issue, reported as #88.

bskinn commented 5 years ago

The initial lookbehind needs to be fixed, too -- if the optional line is up against EOF, then there's no newline for it to advance past, and then the lookbehind can't check existence of the newline that wasn't there for the regex engine to move past.