bear / parsedatetime

Parse human-readable date/time strings
Apache License 2.0
695 stars 106 forks source link

Mistakenly identify 3.2 in "test version 3.2 by tomorrow" as a date using nlp #150

Open phoebebright opened 8 years ago

phoebebright commented 8 years ago

Using Australian locale and then to be sure defining separators:

c = pdt.Constants('en_AU') c.dateSep = ['-', '/']

But insists that 3.2 is a date. Is there any way to prevent this happening?

phoebebright commented 8 years ago

The piece of code is in init.py line 2581 where . and - are added as separators. Have removed these for my purposes as I expect the separators I have defined to be respected.

   dateSeps = ''.join(re.escape(s)
                        for s in self.locale.dateSep + ['-', '.'])
idpaterson commented 8 years ago

I added in the - in all locales awhile back to support yyyy-mm-dd standard format which should be supported in all locales. However, that alternate separator should only be allowed for patterns that include all three components. I will work on fixing that pattern to avoid false positives.

Does anyone know why the . character is added here? It was there before I added the - so I preserved it but now cannot recall whether we discussed the origin of that separator. I suspect that it would have similar behavior matching only yyyy.mm.dd and not x.x as in OP's example but that does not look like a standard format.

idpaterson commented 8 years ago

There are actually test cases for the Australian locale suggesting that it should match with . as a date separator despite never explicitly defining that as a separator for the locale. There are countries that use that separator but I would be very hesitant to ever include it by default for the high likelihood of false positive matches on decimal numbers.

I am going to proceed by fixing the overzealous expression and removing those Australian test cases. The . is actually included in the default base locale dateSep = ['/', '.'] so even after this change most people will still have issues with decimals parsing as dates.