bear / parsedatetime

Parse human-readable date/time strings
Apache License 2.0
694 stars 107 forks source link

Unicode characters can act as word boundaries without the re.UNICODE flag #197

Open idpaterson opened 8 years ago

idpaterson commented 8 years ago

I received a bug report today from a user whose German phrase including the word für matched the regex \br\b. The ü acted as a word boundary because I was not using the re.UNICODE flag.

A quick test showed that the same problem affects parsedatetime. For example, the made up phrase fünacht is parsed as a date based on the word nacht.

>>> cal.parse(u'fünacht', version=2)
(time.struct_time(tm_year=2016, tm_mon=9, tm_mday=13, tm_hour=21, tm_min=0, tm_sec=0, tm_wday=1, tm_yday=257, tm_isdst=1), pdtContext(accuracy=pdtContext.ACU_HALFDAY))
>>> re.search(r'\bnacht', u'fünacht')
<_sre.SRE_Match object at 0x10165b168>
>>> re.search(r'\bnacht', u'fünacht', re.UNICODE)
>>> 

This has implications on any languages that use accents or a non-latin character set, especially if the locale includes any phrases that are common suffixes of other words.