akoumjian / datefinder

Find dates inside text using Python and get back datetime objects
http://datefinder.readthedocs.org/en/latest/
MIT License
634 stars 166 forks source link

Day name false positives #99

Open garethsparks opened 5 years ago

garethsparks commented 5 years ago

False positives are expected, but:

list(datefinder.find_dates("information fripple")) [datetime.datetime(2019, 1, 18, 0, 0)]

Obviously "information fripple" is nonsense, but it contains the string "on fri", so... that's Friday's date. That's going to give you a whole ton of false positives if you're using this on natural language text.

I tried clearing out EXTRA_TOKENS_PATTERN and REPLACEMENTS constants but couldn't get this behavior to stop without removing day of week names entirely. I'm not sure where that "on" token is coming from, but it needs something along the lines of a \b before it in the regex.

akoumjian commented 5 years ago

Let me get a look at the regex patterns to see if we can require a minimum whitespace or delimiter characters there. That is a recently modified feature so that is definitely over the top for false positives.

pgrenon commented 5 years ago

This is still weird when strict is false as it gives the last day of the current month.

list(datefinder.find_dates("information fripple", strict= True)) : [] list(datefinder.find_dates("information fripple", strict= False)) [datetime.datetime(2019, 5, 31, 0, 0)]

Other case of false positive-- as seen in, for example, 'Page 1 of 1' list(datefinder.find_dates("1 of", strict= True)) [] list(datefinder.find_dates("1 of", strict= False)) [datetime.datetime(2019, 5, 1, 0, 0)]

Can a list of stop words or expressions be used to prevent parsing in certain cases?