Closed idpaterson closed 8 years ago
Reviewed 2 of 2 files at r1. Review status: all files reviewed at latest revision, all discussions resolved, some commit checks failed.
Comments from Reviewable
I did not notice any characters in other locale regexes that would cause this kind of problem, but it is admittedly a bit difficult to distinguish in some cases what is used as a regex versus a regular string. It would be helpful for 3.0 to document how the strings are used. For example, Modifiers
is used both as a regex and as a string so if you specified 'foo|bar': -1
it would match either foo or bar, but then when used as a string lookup, neither foo or bar are keys in the Modifiers
dict. It may also be important to avoid adding groups. In the French locale (#185) I noticed a lot of (é|e)
which should either be (?:é|e)
, [ée]
, or split into entirely separate strings with no regex alternation if the value is used for plain string matching.
In most cases it's not terribly difficult to figure these out but better documentation would reduce the barrier to adding locales.
@@ master #192 diff @@
==========================================
Files 14 14
Lines 1570 1570
Methods 0 0
Messages 0 0
Branches 288 288
==========================================
Hits 1211 1211
Misses 266 266
Partials 93 93
Powered by Codecov. Last update e964632...80a2ac3
The meridian regex in the base locale used unescaped periods to match a.m. and p.m. which caused any four-character phrase matching that pattern to be interpreted as a meridian. I added tests to make sure that 3 axpx and 3 pxmx are not interpreted as times.
Note that most non-word characters like 3 p-x. or 3 p m will still match because the meridian regex also matches the single characters
p
anda
followed by any word boundary. Withnlp
you can easily see that both of those cases match only the 3 p portion.I bumped into this problem while working on the nlp whitespace handling. It is a nice bite-sized fix, should be fine for inclusion in 2.2.
This change is