bskinn / pent

pent Extracts Numerical Text -- Mini-language driven parser for structured numerical data in text
MIT License
20 stars 3 forks source link

Abandoning Optional & ZeroOrMore tokens for initial development #33

Closed bskinn closed 6 years ago

bskinn commented 6 years ago

Moots #15, #24, #27, #31.

The problem with these is that the overall regex construction needs to be different, depending on whether there is actually content to capture or not. And, it's by definition impossible to know at regex-construction-time whether a given piece of text to parse has the optional thing or not.

For example, consider the following token line:

~ #..i #?.g #..I ~

If the optional 'general' number is present, then full wordification of all three number values should be included. If it is absent, though, and there is only a single space between the two integers, then the second mandatory whitespace causes the pattern match to fail. Changing the whitespace following optional tokens to optional ([ \\t]*) fixes this, but caused problems (can't remember what, now) with other test cases.

There may be an answer here, but in the interest of not stalling development for potentially rare use-cases, abandoning Optional and ZeroOrMore quantities for now.