akoumjian / datefinder

Find dates inside text using Python and get back datetime objects
http://datefinder.readthedocs.org/en/latest/
MIT License
635 stars 167 forks source link

You codes mistake ages for a year value #164

Closed gkuling closed 2 years ago

gkuling commented 2 years ago

Hello, I am using your package in my project to analyze hospital records and I have noticed an interesting bug I thought I'd share. It identifies ages in the text and converts them into a year value. For example:

test = 'Clinical history: 52-year-old man has...'

will be identified as

datefinder.find_dates(test, base_date=dt(2022, 7, 15)) > datetime.datetime(1952, 7, 15, 0, 0)

Also, the most recent install with pip doesn't have the 'first' parameter option in the DateFinder init function.

Great package btw, thank you Signed - Grey

akoumjian commented 2 years ago

I suggest using the strict=True parameter to make it pickier. There are times when people are looking for almost any date related value and unfortunately that often produces a bunch of false positives. The strict param will only surface dates that look to have a year, month, and day of month.

PyPI has been updated as well!

akoumjian commented 2 years ago

I'll also note, an alternative approach would be to not use strict=True and instead use source=True. This will give you the original text it found that matched and let you run some heuristics on whether or not you want to accept or reject the date.

In [2]: text = 'Clinical history: 52-year-old man has...'

In [3]: print(list(datefinder.find_dates(text, source=True)))
[(datetime.datetime(2052, 8, 3, 0, 0), '52')]

You could do analysis on the source string of "52" and decide it's not sufficient.

In [4]: print(list(datefinder.find_dates(text, source=True, strict=True)))
[]

You can see the strict flag does not consider it a match, but there's no saying what you might consider a valid but incomplete date string that also won't be picked up by this.