Avoid discarding periods and quotes in nlp and parse

idpaterson commented 8 years ago

As discussed in #181 it should be possible to remove the pre-processing in nlp and parse to remove periods followed by spaces and quotes followed or preceded by spaces since the regular expressions now use word boundaries. There are several things that I want to discuss in this pull request.

I am not yet fully convinced that removing the pre-processing is necessary. It caused minor problems with meridian and month abbreviations and took a bit more regex cleanup to fix than I anticipated. There is a discussion below on testing, I think that more comprehensive testing will be necessary before merging this to v3.

The basic problems involved the phrase not quite matching what you would expect:

>>> cal.nlp('5 A.M.')
((datetime.datetime(2016, 9, 10, 5, 0), 2, 0, 3, '5 A'),)
>>> cal.nlp('Aug.')
((datetime.datetime(2017, 8, 1, 9, 41, 35), 1, 0, 3, 'Aug'),)
>>> cal.nlp('"august"')
((datetime.datetime(2017, 8, 1, 9, 44, 13), 1, 0, 7, '"august'),)

After these changes the results are improved:

>>> cal.nlp('5 A.M.')
((datetime.datetime(2016, 9, 10, 5, 0), 2, 0, 6, '5 A.M.'),)
>>> cal.nlp('Aug.')
((datetime.datetime(2017, 8, 1, 9, 41, 35), 1, 0, 4, 'Aug.'),)
>>> cal.nlp('"august"')
((datetime.datetime(2017, 8, 1, 9, 44, 13), 1, 1, 7, 'august'),)

Meridian

In order to support A.M. format with periods I had to move the word boundary from the parsing regex into the locale which is very messy: 'meridian': r'a\.m\.|p\.m\.|(?:am|pm|a|p)\b',. Otherwise, a\.m\.\b won't match because there is no word boundary on that side of the period. Some of our discussions about how locales are specified might lead to a better solution.

Test format

I started a discussion in #196 about improvements to testing. This pull request included some very minor improvements to tests when I ported over tests from parse to nlp.

This change is

codecov-io commented 8 years ago

Current coverage is 78.01% (diff: 100%)

Merging #195 into v3.0 will increase coverage by 0.88%

@@               v3.0       #195   diff @@
==========================================
  Files            14         14          
  Lines          1570       1565     -5   
  Methods           0          0          
  Messages          0          0          
  Branches        288        288          
==========================================
+ Hits           1211       1221    +10   
+ Misses          266        252    -14   
+ Partials         93         92     -1

Powered by Codecov. Last update 39405f7...02f35aa

bear commented 8 years ago

Reviewed 3 of 3 files at r1. Review status: all files reviewed at latest revision, all discussions resolved.

Comments from Reviewable

idpaterson commented 8 years ago

Merged to v3.0 in order to fix the nlp tests that I'm porting over.

bear / parsedatetime