Extract Only Year from text

swathimithran commented 7 years ago

Thanks for this great project. Currently I am able to extract the dates, but for only year i.e for eample "In year 2011 the incident happened." The program retrieves "2011-01-01 00:00:00+00".

But we need to retrieve it as "2011-01-01 12:14:12+00" Can you please let me know how should I change in the library to achieve this.

The basic Aim is to differentiate the original "1st Jan 2011" and "2011".

Thanks

DanielJDufour commented 7 years ago

Great question. Give me 24 hours and I'll have a solution for you :)

swathimithran commented 7 years ago

Thanks man!!!!!! waiting for your reply :)

DanielJDufour commented 7 years ago

Hey, I thought about it a lot and this is what I came up with. You can set return_precision to True and the functions will return a tuple of (date, precision). Precision can be "year", "month", or "day". So precision is "day" for "1st Jan 2011" and "year" for "2011". Consult the Readme for a full example. Let me know if this doesn't work for you and we can work on another solution! Thanks for your interest!

swathimithran commented 7 years ago

Hi Daniel, Thanks for the immediate update, I tried it and its working perfectly fine for our requirement. One more issue which I am facing is in normalisation of dates in the format "98".

For example my text is :

He was selected by the Sacramento Kings in the 2nd round (48th overall) of the 2004 NBA_Draft. A 6'4' guard from Morehead State University, Minard was signed by the Kings in July 2004, but they waived him in November the same year, and so far he has never appeared in an NBA game.

Ouput :

[(datetime.datetime(1948, 1, 1, 0, 0, tzinfo=), 'year'), (datetime.datetime(2004, 1, 1, 0, 0, tzinfo=), 'year'), (datetime.datetime(2004, 7, 1, 0, 0, tzinfo=), 'month')]

So here the 1948 year should not have been fetched.

I think we can solve this issue if we implement a login to only normalise those 2 digits which are preceded by "-" and not followed by "th".

Please let any know if you have any other solution to resolve this.

Thanks & Regards, M Swathi Mithran

DanielJDufour commented 7 years ago

@swathimithran, thanks for the example. Your help is sincerely appreciated! As a quick fix, I made it so it won't capture ordinal numbers that end in th. You can view the change here: https://github.com/DanielJDufour/date-extractor/commit/68deab4d342cec33e690e19cb0562e9ae0b52b44

However, we will need more discussion on what rule should be used for what precedes the 2 digits. Here's a few examples of 2 digit years:

12/23/09
15-11-21
9/1/99 22:00
paper_170120 (in a filename)
taxes_16.docx

Basically, I'm afraid that if we make the rule too strict, people won't be able to parse dates out of filenames.

Here's a few possible solutions. What would you like?

Option 1

Add a parameter source_type, which could be filename, filepath, text, html, or javascript. This way you could customize it, so you can restrict the rules to certain types of sources. Here's an example of what this could look like

from date_extractor import extract_date
string = "I went to my first basketball game in 1990.  My favorite player had number 34."
date = extract_date(string, source_type="text")

string = "my_resume_091216.pdf"
date = extract_date(string, source_type="filename")

Option 2

The second option could be allowing users to override existing patterns.

import date_extractor
date_extractor.patterns['y'] = "\d{4}"
string = "The author asserts that the earliest encounter never happened (43)"
date_extractor.extract_dates(string)

Option 3

The third option could be returning a confidence level, low, medium, or high. You could then filter depending on your need.

from date_extractor import extract_dates
string = "He was selected by the Sacramento Kings in the 2nd round (48th overall) of the 2004 NBA_Draft. A 6'4' guard from Morehead State University, Minard was signed by the Kings in July 2004, but they waived him in November the same year, and so far he has never appeared in an NBA game."
dates = extract_dates(string, return_confidence=True)
# dates = [(datetime.datetime(1948, 1, 1, 0, 0, tzinfo=), 'low'), (datetime.datetime(2004, 1, 1, 0, 0, tzinfo=), 'medium'), (datetime.datetime(2004, 7, 1, 0, 0, tzinfo=), 'high')]

Option 4

I'm open to suggestions as long as it doesn't prevent users from extracting years out of filenames.

Which do you prefer? What do you think?

DanielJDufour / date-extractor