Closed idpaterson closed 8 years ago
@@ v3.0 #195 diff @@
==========================================
Files 14 14
Lines 1570 1565 -5
Methods 0 0
Messages 0 0
Branches 288 288
==========================================
+ Hits 1211 1221 +10
+ Misses 266 252 -14
+ Partials 93 92 -1
Powered by Codecov. Last update 39405f7...02f35aa
Reviewed 3 of 3 files at r1. Review status: all files reviewed at latest revision, all discussions resolved.
Comments from Reviewable
Merged to v3.0 in order to fix the nlp tests that I'm porting over.
As discussed in #181 it should be possible to remove the pre-processing in
nlp
andparse
to remove periods followed by spaces and quotes followed or preceded by spaces since the regular expressions now use word boundaries. There are several things that I want to discuss in this pull request.I am not yet fully convinced that removing the pre-processing is necessary. It caused minor problems with meridian and month abbreviations and took a bit more regex cleanup to fix than I anticipated. There is a discussion below on testing, I think that more comprehensive testing will be necessary before merging this to v3.
The basic problems involved the phrase not quite matching what you would expect:
After these changes the results are improved:
Meridian
In order to support A.M. format with periods I had to move the word boundary from the parsing regex into the locale which is very messy:
'meridian': r'a\.m\.|p\.m\.|(?:am|pm|a|p)\b',
. Otherwise,a\.m\.\b
won't match because there is no word boundary on that side of the period. Some of our discussions about how locales are specified might lead to a better solution.Test format
I started a discussion in #196 about improvements to testing. This pull request included some very minor improvements to tests when I ported over tests from
parse
tonlp
.This change is