bskinn / pent

pent Extracts Numerical Text -- Mini-language driven parser for structured numerical data in text
MIT License
20 stars 3 forks source link

Refactor output regex for whitespace to a variable #26

Open bskinn opened 6 years ago

bskinn commented 6 years ago

Bad practice to leave it as a magic string.

~Can't imagine it ever needing to be anything other than [ \t], but.... there's always an exception.~

~Might be eventually worth allowing user customization of what's whitespace and what's not. Seems unlikely to be useful though.~

GAMESS interatomic distances have asterisks scattered between the numeric values. Adding * to the whitespace should allow ignoring them during parsing while safely using a #!++f token for the data.

Have to think carefully about modifications to the regex whitespace versus propagating that change through to string splitting. May be a useful thing to specifically allow different specs of whitespace for the regex versus the splitting, though.

(Also may need to be able to customize what's considered whitespace on a pattern-by-pattern basis... e.g., in this GAMESS example, it may be desired to only have the asterisks treated as whitespace for the internal data block....)

Probably should have both add_to_whitespace (single char append, with regularization) and set_whitespace (completely replace the string; probably without regularization, or with optional regularization), to allow, e.g., multi-char sequences, like ellipses: (\.\.\.|[ \t]).