bear / parsedatetime

Parse human-readable date/time strings
Apache License 2.0
694 stars 107 forks source link

Some not recognized date formats #190

Open danqing opened 8 years ago

danqing commented 8 years ago

This should return 25th of last month, not 1st:

In [8]: CAL.nlp('25 of last month')
Out[8]: ((datetime.datetime(2016, 8, 1, 9, 0), 1, 6, 16, 'last month'),)

This should return Sep 7:

In [9]: CAL.nlp('wednesday sep 7')
Out[9]: ((datetime.datetime(2017, 9, 13, 1, 4, 51), 1, 0, 15, 'wednesday sep 7'),)

(arguably it's a bit tricky here - it needs to know that wednesday and sep 7 are the same day)

This should return 9/6:

In [10]: CAL.nlp('6th')

(the month should be implicit here.

idpaterson commented 8 years ago

Thanks for the examples, the wednesday sep 7 case is particularly bizarre. It appears to be first resolving Sep 7 which, since today is Sep 8 2016, gives Sep 7 2017, then finding the next Wednesday relative to that date.

In [11]: CAL.nlp('wednesday sep 8')
Out[11]: ((datetime.datetime(2016, 9, 14, 9, 18, 17), 1, 0, 15, 'wednesday sep 8'),)

I question whether 6th should be interpreted as a date. In order to handle that I think we would have to consider an alternate mode that can be used when operating on data that is guaranteed to contain a date versus freeform data that may or may not. The former case may enable additional constructs that can be used to represent a date but may more commonly just be plain numbers or words. There may have been a discussion about this in the past regarding non-separated numeric dates (e.g. parsing 1204 as December 4) but I can't find it.

bear commented 8 years ago

I've always thought (and maybe never wrote it down for everyone to also know... :/ ) that we have a minimum resolution setting that says that in order to be considered a date the code must find at least one instance of a year or month to consider having a day match:

"6th" == no "6th 2016" == yes "6th Sep" == yes "Sep" == yes

idpaterson commented 8 years ago

That sounds like a good contender for how to approach this, @bear. You could almost do what you described right now with pdtContext, except that the accuracy flags do not convey any details about the ambiguity of the actual text that matched.

So unfortunately while you could exclude terms where accuracy == pdtContext.ACU_DAY that not only would drop 6th (assuming pdt parsed that as a date), but also unambiguous terms like today and Saturday. That's a pretty good solution for people parsing data that contains only absolute dates but it won't work for every use case.

A measure of ambiguity could be used to ignore standalone numeric dates as well as questionable abbreviations like sat which may just be the past tense of sit and sun which may be a reference to our nearest star. It gets quite tricky to think of a proper way to do that.

I found our earlier conversation that touched on these ideas for reference. It's worth starting a new thread to discuss this specific concern to see if anyone is actually interested in a solution.

danqing commented 8 years ago

It is indeed tricky - Wednesday Sept 7 should be 9/7 because it is indeed a Wednesday, but Wednesday Sept 8 is "wrong" so can be anything.

There's another thing - some datetimes are intrinsically vague, such as "Wednesday". It can be this wednesday, last wednesday, etc. Is there a way to signal it? Similarly, if I say 3/25, it can be this year or next year.

idpaterson commented 8 years ago

There are a few flags to control how those relative dates are handled. It is up to the developer to choose the correct parse style for the input data. For example, in a calendar app input dates will be in the future 99% of the time so you would assume Tuesday means next Tuesday. A diary app might primarily deal with retrospection where Tuesday is more likely to mean Tuesday of this week

danqing commented 8 years ago

Yeah - although I think having that piece of information (whether the date is ambiguous and in what way) may be helpful too. Sometimes maybe dynamic selection is needed, and if the app knows the logic better it can make the selection then.

For example it may be one canonical date + a cadence (Sept. 14, yearly; Sept. 14, weekly, etc.) and the caller can shift as appropriate.

danqing commented 8 years ago

As of 6th, I wonder if some level of confidence can be added. In a big NLP project, I'd imagine pdt working alongside other parsers (for addresses, names, etc.) and it's up to the main logic to choose the right one when there are conflicts. the nlp function already gives which part of the sentence is the match, which is super useful - a simple check of 6th can determine whether it's a number or a date, depending on the scenario.

idpaterson commented 8 years ago

I am a big fan of the confidence/ambiguity idea but based on how much is already slated for version 3 it won't be added for quite some time unless someone else wants to take on a good challenge. In the meantime you could create a very rudimentary formula based on the number of characters in the string to derive your own confidence (it will certainly not be implemented this way in parsedatetime but it's a start). For example:

import re

number_pattern = r'\d'
separator_pattern = r'[:/-]'
word_pattern = r'\w'
flags = re.UNICODE

def confidence(phrase):
    numbers = len(re.findall(number_pattern, phrase, flags))
    word_characters = len(re.findall(word_pattern, phrase, flags)) - numbers
    separators = len(re.findall(separator_pattern, phrase, flags))
    return numbers + word_characters * 2 + separators * 5
>>> confidence('600')
3
>>> confidence('6th')
5
>>> confidence('16th')
6
>>> confidence('6/30')
8
>>> confidence('6:15')
8
>>> confidence('june')
8
>>> confidence('sat')
6
>>> confidence('saturday')
16
>>> confidence('05/04/2016')
18

You could take anything with confidence >= 8 and have decent results. Not perfect.

The actual implementation will be at the locale level where it is known and can be explicitly identified that, for example, "sat" is more ambiguous than "thu" because "sat" is not only the abbreviation for Saturday but also the past tense of a common verb.

FraBle commented 8 years ago

Another missing pattern is "day after" and "day before", e.g. "day after tomorrow".

danqing commented 8 years ago

@idpaterson great formula - I can see it working well. However I think the problem here is the low confidence ones are not handled by pdt currently?