akoumjian / datefinder

Find dates inside text using Python and get back datetime objects
http://datefinder.readthedocs.org/en/latest/
MIT License
635 stars 167 forks source link

Incorrect datetime when adding 4 digits integer to the input text #146

Open sanchezg opened 3 years ago

sanchezg commented 3 years ago

Hello everyone, I got this error using v0.7.1:

In [8]: datefinder.find_dates('Your item arrived at APO, AE 09123 on January 25, 2021 at 7:02 pm.').__next__()
Out[8]: datetime.datetime(9123, 1, 25, 19, 2)

As you can see, that's an incorrect year parsing. The expected output is: datetime.datetime(2021, 1, 25, 19, 2). Can anybody point me what is the regex or pattern that I should look into to fix this problem?

sanchezg commented 3 years ago

Just figured out that parse_date_string method receives:

ipdb> match_str
'09123 on January 25, 2021 at 7:02 pm'
ipdb> captures
{'undelimited_stamps': [], 'years': ['2021'], 'months': ['January'], 'days': [], 'hours': ['7'], 'minutes': ['02'], 'seconds': [], 'microseconds': [], 'offset': [], 'time': ['7:02 pm'], 'time_periods': ['pm'], 'timezones': [], 'numbers': [], 'digits': ['09123', '25'], 'digits_suffixes': [], 'delimiters': [' ', ' ', ' ', ' ', ', ', ' ', ' ', '. '], 'positionnal_tokens': [], 'extra_tokens': ['on', 'at']}

And is parser.parse which is taking that 09123 as a valid year. Maybe the tokenizer isn't working well? Or maybe we should replace tokens (while they are valid) before trying to convert it using parser module?