akoumjian / datefinder

Find dates inside text using Python and get back datetime objects
http://datefinder.readthedocs.org/en/latest/
MIT License
635 stars 167 forks source link

Simple range-finding patch based on nested regex #58

Closed mjbommar closed 2 years ago

mjbommar commented 7 years ago

@akoumjian , we know that you've worked on issue #18 in the past and haven't found a solution you're happy with. We have a similar use case to #18, #29, etc. and have been using the following modified approach to resolve it. It is simple and English-specific, but has worked for us with thousands of real documents.

There are two considerations here:

You can see the desired behavior exhibited here:

>>> import datefinder
>>> datefinder.find_dates("I left on January 1, 2017 and was gone from January 2, 2017 to January 31, 2017.")
<generator object find_dates at 0x7f105867b4b0>
>>> list(datefinder.find_dates("I left on January 1, 2017 and was gone from January 2, 2017 to January 31, 2017."))
[datetime.datetime(2017, 1, 1, 0, 0), datetime.datetime(2017, 1, 2, 0, 0), datetime.datetime(2017, 1, 31, 0, 0)]

One option is to allow the user to enable/disable the range search. Another is to keep track of dates found in the range regex and exclude them from the non-range regex. A third would be to modify the single regex to optionally match one or more dates separated with these range delimiters.

For a variety of reasons, this is the one we settled on, and we thought that we'd share it in case it helps you or others.

akoumjian commented 2 years ago

There is a range implementation in now. Still not very happy with it and I'll be looking at this implementation in comparison.