akoumjian / datefinder

Find dates inside text using Python and get back datetime objects
http://datefinder.readthedocs.org/en/latest/
MIT License
634 stars 166 forks source link

Incorrect dates getting extracted for OCR errored cases #173

Closed vibhas-singh closed 2 years ago

vibhas-singh commented 2 years ago

The parser is outputting incorrect dates for cases with slight OCR error (like digit 1 getting replaced by alphabet l).

date_string = "January 11, 2028" # Digit 1
list(datefinder.find_dates(date_string))
# [datetime.datetime(2028, 1, 11, 0, 0)]

date_string = "January l1, 2028" # Digit 1 replaced by alphabet l
list(datefinder.find_dates(date_string))
# [datetime.datetime(2028, 1, 4, 0, 0)]

Ideally, it shouldn't be extracting any date in the second case - but it is silently giving the wrong date. Is there any way to handle such cases explicitly?

khanfarhan10 commented 2 years ago

Practically speaking, this is a feature and not a bug, rather!

vibhas-singh commented 2 years ago

@khanfarhan10 How I would like it to work: Throw an error (or not outputting any dates at all) if it cannot parse the dates correctly - so the user can handle that accordingly. Silently giving incorrect dates makes it very difficult to use it as-is on a variety of data without risking the performance.

Also, I am actually not able to comprehend from there it is getting the day as 4.

akoumjian commented 2 years ago

@vibhas-singh

Silently giving incorrect dates makes it very difficult to use it as-is on a variety of data without risking the performance.

This library is designed to extract as many possible dates within freeform text. This includes, by default incomplete dates (e.g. Jan 1987 and dates in odd formats 2020 Feb., 7th. When something doesn't look like a date, we simply move on. No sense in throwing errors in random numbers or words (e.g. 5555-33-01). That would make it unusable for many applications. If you have your date string isolated from the rest of your text, you can use tools like dateparser to throw an error if it doesn't parse correctly.

The reason you are getting a day of month of 4 has to do with matching an incomplete date. Inside "January l1, 2028" the library found January and 2028 and did its best with it. Since a datetime has to have a day of the month and not just month/year, the underlying dateparser library fills in gaps with a base date. You can override a base date here: https://github.com/akoumjian/datefinder/blob/master/datefinder/__init__.py#L320).

You have a couple different options, as I see it. If you use strict=True with datefinder, you won't get matches for dates that don't consist of at least year, month, day of month. In your original example, it would simply skip over it. Since you are trying to extract dates from bad quality data (literally the date you want is not in your text due to the OCR error), you could alternatively use source=True when you use datefinder and pass every result through a custom heuristic function or manual review process. This way you can look at the originally matched text "January l1, 2028" and decide what to do with it.

akoumjian commented 2 years ago

I would consider investigating libraries which use statistical models to help reduce your OCR errors like the one above.