bear / parsedatetime

Parse human-readable date/time strings
Apache License 2.0
695 stars 106 forks source link

Revised testing architecture with yaml data #199

Open idpaterson opened 8 years ago

idpaterson commented 8 years ago

As originally discussed in #196, this pull request begins the conversion from unittest2 to py.test. py.test was already used to run the tests, but as an alternative to unittest2 it provides an opportunity for intriguing code-level instrumentation to make tests easier to write and more comprehensive.

The following tests have been ported so far:

Additional tests needed for the test utilities:

This pull request can be merged once tests have been rewritten to cover all of the current test cases. At that time there will still be a lot of work to do to improve and add tests especially in non-English locales; merging to v3 will allow pull requests for community contributions.

There will be no changes to the functionality of parsedatetime in this pull request other than instrumentation necessary for testing. For example, a "wildcard" flag was added to pdtContext to support tests that do not specify an explicit context.

Additional testing support

Testing against "today" and edge case dates

Most tests in pdt 2 were based on the current date as sourceTime rather than a predetermined time. That can be good and bad – I think it made the tests more confusing to reason about but it also helped to catch edge cases related to leap years, end of year, end of month, etc. I'm looking at you, #155! 99% of the time, "today" is just a normal time on a normal day and there is no guarantee that CI would happen to run on Feb 29. If it does and an error comes up, it's already too late to push out a fix.

For tests where the sourceTime is truly arbitrary (absolute dates and anything that can be expressed with deltas come to mind) it would be interesting to parametrize the function with a few edge case sourceTimes. Run it against Feb 29, Dec 31, Jan 1, today, etc so that those cases are caught intentionally rather than accidentally.

Testing NLP may require a list of targets

Some changes will need to be made to allow multiple target dates in test data.

Notes

This is a collaborative pull request

Any parsedatetime maintainers can commit on this PR and everyone is welcome to the discussion. I am also going to continue working on the test cases.

The old tests have been deleted from this branch

I deleted the old tests to avoid further housekeeping later; anyone who would graciously contribute to this pull request will need to use an alternate copy of the old tests for comparison.

The new test names and file structure does not necessarily match the old tests

Some previously separate test classes will be combined. For example, the cases for TestMultiple and TestDelta are both in deltas.yaml because the "multiples" were just multi-unit deltas. I used comments to separate them into sections within the data file.

Documentation

I will start a new issue for this but since the first bit of code is now public I wanted to express my preference to move away from epydoc. I have always found the HTML documentation very difficult to use... the content is well-written but the format is cumbersome and outdated. Mike, are you open to hosting documentation on ReadTheDocs or as a GitHub page via the gh_pages branch so that updates can go out automatically?

The test modules in this PR are documented for sphinx with Google style docstrings. The ReadTheDocs theme for sphinx makes for (in my opinion) a better presentation and a simpler, less syntactically-verbose code style. Some of the documentation that I have currently written in the Python docstrings would be better managed in separate rst files, but for now they should be helpful to anyone contributing to the tests. The implementation of that will arrive in a different pull request.


This change is Reviewable

codecov-io commented 8 years ago

Codecov Report

Merging #199 into v3.0 will decrease coverage by -50.26%. The diff coverage is 70%.

@@             Coverage Diff             @@
##             v3.0     #199       +/-   ##
===========================================
- Coverage   78.04%   27.79%   -50.26%     
===========================================
  Files          14       14               
  Lines        1567     1576        +9     
  Branches      288      291        +3     
===========================================
- Hits         1223      438      -785     
- Misses        252     1120      +868     
+ Partials       92       18       -74
Impacted Files Coverage Δ
parsedatetime/context.py 56% <70%> (-25.82%) :x:
parsedatetime/pdt_locales/icu.py 10.58% <0%> (-72.95%) :x:
parsedatetime/init.py 18.26% <0%> (-56.7%) :x:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 5b4fd8e...4283265. Read the comment docs.

bear commented 8 years ago

Do you want me to review the changes as you make them or wait until you have the majority of them made?

idpaterson commented 8 years ago

That's up to you, I'm going to work on the cases that require additional support in the pytest setup first then port the remaining normal cases.

idpaterson commented 8 years ago

How does this look for an nlp test?

long_phrases:
    sourceTime: 2013-08-01 21:25:00
    cases:
        - target: !nlpTarget
            - phrase: "At 8PM on August 5th"
              target: 2013-08-05 20:00:00
              context: !pdtContext month | day | hour
            - phrase: "next Friday at 9PM"
              target: 2013-08-09 21:00:00
              context: !pdtContext day | hour
            - phrase: "in 5 minutes"
              target: 2013-08-01 21:30:00
              context: !pdtContext minute
            - phrase: "next week"
              target: 2013-08-08 09:00:00
              context: !pdtContext week
          phrases:
            - >
                I'm so excited!! At 8PM on August 5th i'm going to fly to 
                Florida. Then next Friday at 9PM i'm going to Dog n Bone! 
                And in 5 minutes I'm going to eat some food! Talk to you 
                next week.
@pdtFixture('multiple_dates.yml')
def test_long_phrases(cal, phrase, sourceTime, nlpTarget):
    assert cal.nlp(phrase, sourceTime) == nlpTarget

The target normalization adds the startIndex and endIndex for each phrase automatically based on substrings, so unless startIndex and endIndex are specified in the data file we just have to avoid having a string appear twice in the text.

Sample failures

Wrong context

cal = <parsedatetime.Calendar object at 0x109962790>
phrase = "I'm so excited!! At 8PM on August 5th i'm going to fly to  Florida. Then next Friday at 9PM i'm going to Dog n Bone!  And in 5 minutes I'm going to eat some food! Talk to you  next week."
sourceTime = datetime.datetime(2013, 8, 1, 21, 25)
nlpTarget = ((datetime.datetime(2013, 8, 5, 20, 0), pdtContext(accuracy=pdtContext.ACU_MONTH | pdtContext.ACU_DAY), 17, 37, 'At 8P...in 5 minutes'), (datetime.datetime(2013, 8, 8, 9, 0), pdtContext(accuracy=pdtContext.ACU_WEEK), 176, 185, 'next week'))

    @pdtFixture('multiple_dates.yml')
    def test_long_phrases(cal, phrase, sourceTime, nlpTarget):
>       assert cal.nlp(phrase, sourceTime) == nlpTarget
E       assert ((datetime.da... 'next week')) == ((datetime.dat... 'next week'))
E         At index 0 diff: (datetime.datetime(2013, 8, 5, 20, 0), pdtContext(accuracy=pdtContext.ACU_MONTH | pdtContext.ACU_DAY | pdtContext.ACU_HOUR), 17, 37, 'At 8PM on August 5th') != (datetime.datetime(2013, 8, 5, 20, 0), pdtContext(accuracy=pdtContext.ACU_MONTH | pdtContext.ACU_DAY), 17, 37, 'At 8PM on August 5th')
E         Full diff:
E         ((datetime.datetime(2013, 8, 5, 20, 0),
E         -   pdtContext(accuracy=pdtContext.ACU_MONTH | pdtContext.ACU_DAY | pdtContext.ACU_HOUR),
E         ?                                                                ----------------------
E         +   pdtContext(accuracy=pdtContext.ACU_MONTH | pdtContext.ACU_DAY),
E         17,
E         37,
E         'At 8PM on August 5th'),
E         Detailed information truncated (15 more lines), use "-vv" to show

Missing date in test target

cal = <parsedatetime.Calendar object at 0x1109b3790>
phrase = "I'm so excited!! At 8PM on August 5th i'm going to fly to  Florida. Then next Friday at 9PM i'm going to Dog n Bone!  And in 5 minutes I'm going to eat some food! Talk to you  next week."
sourceTime = datetime.datetime(2013, 8, 1, 21, 25)
nlpTarget = ((datetime.datetime(2013, 8, 5, 20, 0), pdtContext(accuracy=pdtContext.ACU_MONTH | pdtContext.ACU_DAY | pdtContext.ACU...riday at 9PM'), (datetime.datetime(2013, 8, 8, 9, 0), pdtContext(accuracy=pdtContext.ACU_WEEK), 176, 185, 'next week'))

    @pdtFixture('multiple_dates.yml')
    def test_long_phrases(cal, phrase, sourceTime, nlpTarget):
>       assert cal.nlp(phrase, sourceTime) == nlpTarget
E       assert ((datetime.da... 'next week')) == ((datetime.dat... 'next week'))
E         At index 2 diff: (datetime.datetime(2013, 8, 1, 21, 30), pdtContext(accuracy=pdtContext.ACU_MIN), 122, 134, 'in 5 minutes') != (datetime.datetime(2013, 8, 8, 9, 0), pdtContext(accuracy=pdtContext.ACU_WEEK), 176, 185, 'next week')
E         Left contains more items, first extra item: (datetime.datetime(2013, 8, 8, 9, 0), pdtContext(accuracy=pdtContext.ACU_WEEK), 176, 185, 'next week')
E         Full diff:
E         ((datetime.datetime(2013, 8, 5, 20, 0),
E         pdtContext(accuracy=pdtContext.ACU_MONTH | pdtContext.ACU_DAY | pdtContext.ACU_HOUR),
E         17,
E         37,
E         'At 8PM on August 5th'),
E         (datetime.datetime(2013, 8, 9, 21, 0),
E         pdtContext(accuracy=pdtContext.ACU_DAY | pdtContext.ACU_HOUR),
E         73,
E         91,
E         'next Friday at 9PM'),
E         -  (datetime.datetime(2013, 8, 1, 21, 30),
E         -   pdtContext(accuracy=pdtContext.ACU_MIN),
E         -   122,
E         -   134,
E         -   'in 5 minutes'),
E         (datetime.datetime(2013, 8, 8, 9, 0),
E         pdtContext(accuracy=pdtContext.ACU_WEEK),
E         176,
E         185,
E         'next week'))
bear commented 8 years ago

I like it - it reads as easily as the others and doesn't require any repeated items. The "avoid having a string appear twice" constraint seems reasonable for a test suite

idpaterson commented 8 years ago

Yeah and it's only a constraint of convenience. There will need to be a test that includes multiples of the same string because that is an important thing to test, but that test will just need to specify startIndex and endIndex in the YAML.

idpaterson commented 8 years ago

nlp tests are now implemented. Some of the existing nlp tests were specific to nlp (e.g. multiple phrases in a string) and some were generic date and times.

A new tests.data.nlpTarget class wraps data from the test and supports equality testing with nlp responses. It provides a consistent representation for tests with both normal and nlp targets (i.e. target: 2016-01-01 00:00:00 vs target: !nlpTarget ...) and allows some flexibility for the python to modify the test data. For example, the following test updates the sourcePhrase after wrapping the original phrase in quotes:

@pytest.mark.parametrize('prefix,suffix', (('"', '"'), ("'", "'"), ('(', ')')))
@pdtFixture('simple_datetimes.yml', ['times', 'invalid_times', 'dates',
                                     'invalid_dates'])
def test_simple_datetimes_wrapped(cal, phrase, sourceTime, nlpTarget, prefix,
                                  suffix):
    sourcePhrase = u'%s%s%s' % (prefix, phrase, suffix)
    nlpTarget.sourcePhrase = sourcePhrase
    assert cal.nlp(sourcePhrase, sourceTime) == nlpTarget

This allows the nlpTarget to calculate the proper start and end index for the phrase.

The test data can also specify startIndex but the documentation warns that this should only be done when required due to the string appearing more than once:

- target: !nlpTarget
    - phrase: "today"
      context: !pdtContext day
      startIndex: 5
      target: 2013-08-01 09:00:00
    - phrase: "today"
      context: !pdtContext day
      startIndex: 26
      target: 2013-08-01 09:00:00
  phrases: 
    - "Yep, today was as good as today could be"

If everything specified a startIndex it would be more of a pain to write a phrase modifying test like test_simple_datetimes_wrapped above since the startIndex would not be calculated and therefore would not match when the phrase is modified.

I ran into one case that failed (not in the original nlp tests, I pulled in some parse test cases). Since this pull request is strictly not going to fix anything in code I noted the failure with a FIXME:

def test_deltas(cal, phrase, sourceTime, nlpTarget):
    # FIXME: these tests fail
    if phrase in ('1855336.424 minutes ago',):
        return
    assert cal.nlp(phrase, sourceTime) == nlpTarget

All of these test utilities will need unit tests as well so I added them to the list at the top of this PR.

I toyed around with a !replace constructor that would call datetime.replace() on the source time but couldn't find a compelling enough reason to keep it, preferring absolute datetimes for clarity. It made the test data too abstract and I wasn't able to use it together with a delta. For example, "4pm 2 days from now" is a combination of a replacement (current time to 16:00:00) and a delta (plus 2 days).

idpaterson commented 8 years ago

I need to apologize for the lack of updates recently. A small paid project has taken away the time that I was using to work on parsedatetime. The following is not yet complete enough to commit.

Most recently I have reorganized the test classes to more easily add support for deriving times relative to source times and testing anything that does not require an explicit start date against multiple edge case dates. In the last comment I mentioned an attempt to represent a replacement of date components, I ended up with a succinct syntax for that:

day_suffixes:
  cases:
    - target: !replace 2008-08-22 xx:xx:xx
      phrases:
        - "August 22nd, 2008"

This test group has no sourceTime which means that it will be parametrized against each of the edge case dates (Feb 29 leap year, Feb 28 non leap year, end of year pre-1970). I like with this syntax how the replacement makes it clear that the time component will match that of the source time. I was trying to use obvious source times in the original tests like 01:02:03 to show that (because you could easily miss it if all the times are something like 12:00:00), but this is clearer.

There are some slightly weird consequences of testing against multiple dates when you're dealing with partial dates, for example:

dates:
  sourceTime: !replace xxxx-01-xx xx:xx:xx
  cases:
    - target: !replace xxxx-08-25 xx:xx:xx
      context: !pdtContext month | day
      phrases:
        - "8/25"

In this case if the edge case source times were to fall after August 25 it would map to the next year, so the replacement is used to pin the source time to January. The alternate behavior could then be tested with a source time in October where the date will map to the following year:

dates_next_year:
  sourceTime: !replace xxxx-10-xx xx:xx:xx
  cases:
    - target: !datedelta
        sourceTime: !replace xxxx-08-25 xx:xx:xx
        years: 1
      context: !pdtContext month | day
      phrases:
        - "8/25"

Still plugging along albeit quite slowly now.

bear commented 8 years ago

@idpaterson firstly no one working on an OSS project ever needs to apologize for doing paid work first - that's all part of what we do :)

Second, even if it wasn't paid, never worry about taking time to do something - we all work on this code because we love it and it solves an itch, but real life sometimes intrudes.

thanks for the update and for keeping the project moving forward!

idpaterson commented 7 years ago

TestSimpleOffsets was the first suite where I encountered tests that set flags on Constants. I added the ability to specify those in YAML as we discussed earlier. This just uses setattr to set any attribute to whatever value is specified.

In this example, the second test group uses a different option. I think the markup for the targets makes the difference in behavior between the two groups pretty clear.

offset_from_day_of_week:
    sourceTime: *tuesdays
    cases:
        - target: !datedelta
            days: 2
          context: !pdtContext day
          phrases:
            - "Thursday"
        - target: !datedelta
            # TODO: Handling this as Thursday at 9 plus 1 hour seems
            # unexpected.
            sourceTime: !replace xxxx-xx-xx 09:00:00
            days: 2
            hours: 1
          context: !pdtContext day | hour
          phrases:
            - "one hour from Thursday"
        - target: !datedelta
            # TODO: Handling this as Thursday at 9 minus 1 hour seems
            # unexpected.
            sourceTime: !replace xxxx-xx-xx 09:00:00
            days: 2
            hours: -1
          context: !pdtContext day | hour
          phrases:
            - "one hour before Thursday"

offset_from_day_of_week_matching_source_time:
    sourceTime: *tuesdays
    options:
        StartTimeFromSourceTime: True
    cases:
        - target: !datedelta
            # TODO: Handling this as Thursday at the current time plus 1 hour
            # seems unexpected.
            days: 2
            hours: 1
          context: !pdtContext day | hour
          phrases:
            - "one hour from Thursday"
        - target: !datedelta
            # TODO: Handling this is Thursday at the current time minus 1 hour
            # seems unexpected.
            days: 2
            hours: -1
          context: !pdtContext day | hour
          phrases:
            - "one hour before Thursday"

I'm also leaving comments in the tests for behavior that seems unexpected and that may require discussion.

idpaterson commented 7 years ago

TestRanges introduced a new format I had not yet considered, a target containing multiple dates. It was trivial to call resolveTarget recursively on lists to handle this case:

dates:
    cases:
        - target:
            - !replace 2006-08-29 xx:xx:xx
            - !replace 2006-09-02 xx:xx:xx
          context: 1
          phrases:
            - "August 29, 2006 - September 2, 2006"
            - "August 29 - September 2, 2006"
            - "08/29/06 - 09/02/06"

Since this is not a very common thing I decided not to add startTarget and endTarget fixtures. Instead, the tests just destructure the target list:

@pdtFixture('ranges.yml')
def test_times(calendar, phrase, sourceTime, target, context):
    (startTarget, endTarget) = target
    assert (calendar.evalRanges(phrase, sourceTime) ==
            (startTarget.timetuple(), endTarget.timetuple(), context))

This is very similar to how the !nlpTarget markup looks, so technically I could add a rangeTarget class but I don't think it is necessary since the values that would be used here do not require any special logic.

idpaterson commented 7 years ago

Another bug was exposed by the new tests today where the old tests just happened to be written in a way that worked. For these particular phrases which are not valid dates, parse returns the current datetime regardless of the sourceTime provided.

calendar = <parsedatetime.Calendar object at 0x1078b6150>, phrase = '18/35', sourceTime = datetime.datetime(1945, 12, 31, 3, 4, 5), context = pdtContext()

    @pdtFixture('errors.yml')
    def test_out_of_range(calendar, phrase, sourceTime, context):
>       assert calendar.parse(phrase, sourceTime) == (sourceTime.timetuple(),
                                                      context)
E       assert (time.struct_... pdtContext()) == (time.struct_t... pdtContext())
E         At index 0 diff: time.struct_time(tm_year=2017, tm_mon=2, tm_mday=27, tm_hour=8, tm_min=42, tm_sec=47, tm_wday=0, tm_yday=58, tm_isdst=0) != time.struct_time(tm_year=1945, tm_mon=12, tm_mday=31, tm_hour=3, tm_min=4, tm_sec=5, tm_wday=0, tm_yday=365, tm_isdst=-1)
E         Full diff:
E         - (time.struct_time(tm_year=2017, tm_mon=2, tm_mday=27, tm_hour=8, tm_min=42, tm_sec=47, tm_wday=0, tm_yday=58, tm_isdst=0),
E         ?                           -- ^                    ^^          ^          -         ^^                      -           ^
E         + (time.struct_time(tm_year=1945, tm_mon=12, tm_mday=31, tm_hour=3, tm_min=4, tm_sec=5, tm_wday=0, tm_yday=365, tm_isdst=-1),
E         ?                            ^^^         +           ^^          ^                   ^                     ++            ^^
E         pdtContext())
E         Detailed information truncated (-2 more lines), use "-vv" to show

tests/test_errors.py:6: AssertionError
idpaterson commented 7 years ago

@bear, I noticed that the CI and coverage have not been running on recent commits for this pull request. Would you please check whether those are still operational, maybe I hit a limit on commits for this PR?

idpaterson commented 7 years ago

All of the original base/English language test cases are now implemented in the new format. There is still some work to do before this can be merged, so I would like to aim for the following plan:

  1. Double check that all old test cases are adequately covered
  2. Add comments in test cases to aid translation and highlight nuances (e.g. identify that a word was chosen because it contains the abbreviation of a month to avoid someone translating the word directly rather than selecting a different word containing a month in the target language)
  3. Add more test cases where appropriate since it is now easier to add more phrases and to see what is being tested
  4. Discuss and finalize the test data structure and write meta tests for the test utilities
  5. Merge this PR into v3.0
  6. Pull request for the new sphinx documentation introduced in the new test instrumentation and update existing docstrings
  7. Create issues for bugs that surfaced while working on this, most of which are marked with TODO or FIXME in the test data (there are about 7 or 8 things to fix).
  8. Create issues to ask for community involvement in updating each locale's test cases, link to docs for translation instructions

I don't foresee any new features making it into 3.0 unless we get pull requests for them since there are already a number of bugs to fix and based on past experience trying to improve the German locale I suspect that more issues will pop up while working through the locales.

idpaterson commented 7 years ago

The test library now has 100% test coverage though I could not at least at this time justify creating separate unit tests for every aspect of the test lib. Before the additional tests, the parsedatetime tests provided 81% coverage of data.py and 92% coverage of fixtures.py. I have not yet worked on 100% test coverage of parsedatetime itself.

I think the last main step before merging this to v3.0 is to add comments in the test case data to aid translation.