Fixes false meridians (four-letter a*m* or p*m* words like "3 puma")

idpaterson commented 8 years ago

The meridian regex in the base locale used unescaped periods to match a.m. and p.m. which caused any four-character phrase matching that pattern to be interpreted as a meridian. I added tests to make sure that 3 axpx and 3 pxmx are not interpreted as times.

Note that most non-word characters like 3 p-x. or 3 p m will still match because the meridian regex also matches the single characters p and a followed by any word boundary. With nlp you can easily see that both of those cases match only the 3 p portion.

I bumped into this problem while working on the nlp whitespace handling. It is a nice bite-sized fix, should be fine for inclusion in 2.2.

This change is

bear commented 8 years ago

Reviewed 2 of 2 files at r1. Review status: all files reviewed at latest revision, all discussions resolved, some commit checks failed.

Comments from Reviewable

idpaterson commented 8 years ago

I did not notice any characters in other locale regexes that would cause this kind of problem, but it is admittedly a bit difficult to distinguish in some cases what is used as a regex versus a regular string. It would be helpful for 3.0 to document how the strings are used. For example, Modifiers is used both as a regex and as a string so if you specified 'foo|bar': -1 it would match either foo or bar, but then when used as a string lookup, neither foo or bar are keys in the Modifiers dict. It may also be important to avoid adding groups. In the French locale (#185) I noticed a lot of (é|e) which should either be (?:é|e), [ée], or split into entirely separate strings with no regex alternation if the value is used for plain string matching.

In most cases it's not terribly difficult to figure these out but better documentation would reduce the barrier to adding locales.

codecov-io commented 8 years ago

Current coverage is 77.13% (diff: 100%)

Merging #192 into master will not change coverage

@@             master       #192   diff @@
==========================================
  Files            14         14          
  Lines          1570       1570          
  Methods           0          0          
  Messages          0          0          
  Branches        288        288          
==========================================
  Hits           1211       1211          
  Misses          266        266          
  Partials         93         93

Powered by Codecov. Last update e964632...80a2ac3

bear / parsedatetime