@akoumjian , we know that you've worked on issue #18 in the past and haven't found a solution you're happy with. We have a similar use case to #18, #29, etc. and have been using the following modified approach to resolve it. It is simple and English-specific, but has worked for us with thousands of real documents.
There are two considerations here:
situations where both "ranges" and non-range dates exist, e.g., "I left on January 1, 2017 and was gone from January 2, 2017 to January 31, 2017."
performance of two regex passes instead of one
You can see the desired behavior exhibited here:
>>> import datefinder
>>> datefinder.find_dates("I left on January 1, 2017 and was gone from January 2, 2017 to January 31, 2017.")
<generator object find_dates at 0x7f105867b4b0>
>>> list(datefinder.find_dates("I left on January 1, 2017 and was gone from January 2, 2017 to January 31, 2017."))
[datetime.datetime(2017, 1, 1, 0, 0), datetime.datetime(2017, 1, 2, 0, 0), datetime.datetime(2017, 1, 31, 0, 0)]
One option is to allow the user to enable/disable the range search. Another is to keep track of dates found in the range regex and exclude them from the non-range regex. A third would be to modify the single regex to optionally match one or more dates separated with these range delimiters.
For a variety of reasons, this is the one we settled on, and we thought that we'd share it in case it helps you or others.
@akoumjian , we know that you've worked on issue #18 in the past and haven't found a solution you're happy with. We have a similar use case to #18, #29, etc. and have been using the following modified approach to resolve it. It is simple and English-specific, but has worked for us with thousands of real documents.
There are two considerations here:
"I left on January 1, 2017 and was gone from January 2, 2017 to January 31, 2017."
You can see the desired behavior exhibited here:
One option is to allow the user to enable/disable the range search. Another is to keep track of dates found in the range regex and exclude them from the non-range regex. A third would be to modify the single regex to optionally match one or more dates separated with these range delimiters.
For a variety of reasons, this is the one we settled on, and we thought that we'd share it in case it helps you or others.