Broken support for short days / months with periods in them

bpotard commented 7 years ago

Hello,

It seems that short months / short days are not handled correctly generally speaking - pretty much none of the abbreviations in locales from ICU are being recognised because of this.

For example:

>>> ces = parsedatetime.Constants('es_ES')
>>> print ces.locale.shortMonths
[u'ene.', u'feb.', u'mar.', u'abr.', u'may.', u'jun.', u'jul.', u'ago.', u'sept.', u'oct.', u'nov.', u'dic.']
>>> cal = parsedatetime.Calendar(ces)
>>> cal.parseDT('3 ene. 2014')
(datetime.datetime(2017, 7, 26, 20, 14), 2)
>>> cal.parseDT('3 enero 2014')
(datetime.datetime(2014, 1, 3, 17, 14, 2), 1)

This can be traced to the way the matching regex are built: there is a word boundary imposed at the end of each matching group - which will normally match a period, but not the ". " transition , so I believe the final period in all abbreviations should really be removed from the regexes.

BTW, as a side effect you can create some evil crashes:

>>> cal.parseDT('3 ene.2014')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/blaise/.local/lib/python2.7/site-packages/parsedatetime/__init__.py", line 1771, in parseDT
    version=version)
  File "/home/blaise/.local/lib/python2.7/site-packages/parsedatetime/__init__.py", line 1845, in parse
    retS, retTime, matched = parseMeth(s, sourceTime)
  File "/home/blaise/.local/lib/python2.7/site-packages/parsedatetime/__init__.py", line 1483, in _partialParseDateStr
    sourceTime = self._evalDateStr(parseStr, sourceTime)
  File "/home/blaise/.local/lib/python2.7/site-packages/parsedatetime/__init__.py", line 1113, in _evalDateStr
    return self.parseDateText(s, sourceTime)
  File "/home/blaise/.local/lib/python2.7/site-packages/parsedatetime/__init__.py", line 513, in parseDateText
    mth = m.group('mthname')
AttributeError: 'NoneType' object has no attribute 'group'

I came up with the following patch:

diff --git a/parsedatetime/__init__.py b/parsedatetime/__init__.py
index 990b9ca..0e3b41e 100644
--- a/parsedatetime/__init__.py
+++ b/parsedatetime/__init__.py
@@ -2393,6 +2393,9 @@ class Constants(object):

         if self.locale is not None:

+            def _sanitize_key(k):
+                return re.sub(".$","", k)
+
             def _getLocaleDataAdjusted(localeData):
                 """
                 If localeData is defined as ["mon|mnd", 'tu|tues'...] then this
@@ -2401,9 +2404,9 @@ class Constants(object):
                 adjusted = []
                 for d in localeData:
                     if '|' in d:
-                        adjusted += d.split("|")
+                        adjusted += map(_sanitize_key, d.split("|"))
                     else:
-                        adjusted.append(d)
+                        adjusted.append(_sanitize_key(d))
                 return adjusted

             def re_join(g):
@@ -2446,9 +2449,9 @@ class Constants(object):
                 for key in localeData:
                     if '|' in key:
                         for k in key.split('|'):
-                            offsetDict[k] = o
+                            offsetDict[_sanitize_key(k)] = o
                     else:
-                        offsetDict[key] = o
+                        offsetDict[_sanitize_key(key)] = o
                     o += 1

             _buildOffsets(self.locale.WeekdayOffsets,

After applying this patch you can happily do the following:

>>> cal.parseDT('3 ene. 2014')
(datetime.datetime(2014, 1, 3, 17, 49, 21), 1)
>>> cal.parseDT('3 ene 2014')
(datetime.datetime(2014, 1, 3, 17, 50, 9), 1)
>>> cal.parseDT('3 ene.2014')
(datetime.datetime(2018, 1, 3, 17, 50, 41), 1)

I believe this issue should be addressed as this would make ICU supported locales work much more reliably.

idpaterson commented 7 years ago

This fix looks good to me, it could also be addressed in the icu locale.

TL;DR for anyone seeing this issue: ICU uses punctuated month abbreviations (e.g. aug.) which wreak havoc on date parsing. The problem does not affect locales built in to parsedatetime because the abbreviations are not punctuated (e.g. aug).

bpotard commented 7 years ago

Thanks for the quick response!

You are right, I did not have a look at how the icu locale were built, and it makes more sense to sanitize the abbreviated days / months there rather than in __init__.py

Also, I have just noticed that my sanitization function was completely broken anyway, as it was actually removing the last character from everything not just the period, and did not handle keys with "|" separators. Long story short, it was breaking up pretty much everything...

Here is a new attempt:

diff --git a/parsedatetime/pdt_locales/icu.py b/parsedatetime/pdt_locales/icu.py
index 8bee64b..4479f6b 100644
--- a/parsedatetime/pdt_locales/icu.py
+++ b/parsedatetime/pdt_locales/icu.py
@@ -35,6 +35,11 @@ def merge_weekdays(base_wd, icu_wd):

 def get_icu(locale):
+
+    def _sanitize_key(k):
+        import re
+        return re.sub("\\.(\\||$)", "\\1", k)
+
     from . import base
     result = dict([(key, getattr(base, key))
                    for key in dir(base) if not key.startswith('_')])
@@ -58,16 +63,16 @@ def get_icu(locale):

     # grab ICU list of weekdays, skipping first entry which
     # is always blank
-    wd = [w.lower() for w in symbols.getWeekdays()[1:]]
-    swd = [sw.lower() for sw in symbols.getShortWeekdays()[1:]]
+    wd = [_sanitize_key(w.lower()) for w in symbols.getWeekdays()[1:]]
+    swd = [_sanitize_key(sw.lower()) for sw in symbols.getShortWeekdays()[1:]]

     # store them in our list with Monday first (ICU puts Sunday first)
     result['Weekdays'] = merge_weekdays(result['Weekdays'],
                                         wd[1:] + wd[0:1])
     result['shortWeekdays'] = merge_weekdays(result['shortWeekdays'],
                                              swd[1:] + swd[0:1])
-    result['Months'] = [m.lower() for m in symbols.getMonths()]
-    result['shortMonths'] = [sm.lower() for sm in symbols.getShortMonths()]
+    result['Months'] = [_sanitize_key(m.lower()) for m in symbols.getMonths()]
+    result['shortMonths'] = [_sanitize_key(sm.lower()) for sm in symbols.getShortMonths()]
     keys = ['full', 'long', 'medium', 'short']

     createDateInstance = pyicu.DateFormat.createDateInstance

idpaterson commented 7 years ago

Thanks for updating, I think there are a few more things to look at here. I reviewed some of the locales defined in ICU and it seems that most abbreviations are found in day names, month names, eras (e.g. A.D.), and meridian (e.g. P.M.).

A similar problem could exist for meridian, which for the es locale is defined as a. m. and p. m.. I won't be able to review this for a few days, would you please test whether the meridian needs to be fixed as well? At a glance it doesn't look like the meridian from icu makes it into any regular expressions.

bear commented 5 years ago

I merged the PR as the meridian does not get bundled into any regular expressions

bear / parsedatetime

Broken support for short days / months with periods in them #221