akoumjian / datefinder

Find dates inside text using Python and get back datetime objects
http://datefinder.readthedocs.org/en/latest/
MIT License
635 stars 167 forks source link

replace Regex range splitting for python logic #102

Closed AndreyCorelli closed 5 years ago

AndreyCorelli commented 5 years ago

Dear Alec,

I appreciate sharing your code. It does what it is designed for. But sometimes while using DateFinder I faced Regex catastrophic backtracking problem. Especially while parsing "tables" in plain text like this: codepile sample.

The root of the problem was in RANGE_REGEX. I have replaced this logic by simply splitting the source text by "to" / "through" keywords. I've also simplified the main (DATE_REGEX) regex a bit.

The plain text I referenced above took about 48s to be parsed. Now, after this code update, this piece of text takes only 0.007s.

Could you consider applying my changes or, at least, changing the logic of splitting date ranges (RANGE_REGEX)?

Thank you in advance!

akoumjian commented 5 years ago

I like your changes but please remove the optional type checking since we're currently not using that in this package.