akoumjian / datefinder

Find dates inside text using Python and get back datetime objects
http://datefinder.readthedocs.org/en/latest/
MIT License
635 stars 165 forks source link

Date parse issue #166

Closed archerne closed 2 years ago

archerne commented 2 years ago

Any idea on why 2020 FEB 10 PM 4:52 parses as 2020-02-09 04:52:00?

akoumjian commented 2 years ago

Your 9 is coming from the date that you ran the code. It didn't recognize the 10 as a date and so is using dateutil's base_date.

The position of the PM is throwing it off. It is highly unusual to place PM both before a hours/minutes and right after a date. If you run it without the time at the end:

In [15]: text = "2020 FEB 10 PM"

In [16]: print(next(datefinder.find_dates(text, source=True)))
(datetime.datetime(2020, 2, 29, 22, 0), '2020 FEB 10 PM')

You can see that it saw 10 PM as it is making it's way through the text and rightfully cast it as a time. It didn't see anything that looked like a day of the month, so it defaulted to today.

Then when the regex finds 4:52, it says "wait, this is a time!" and it uses that time instead because it is more verbose or because it finds it later and overwrites. So your 10 is getting dropped altogether.