bear / parsedatetime

Parse human-readable date/time strings
Apache License 2.0
694 stars 107 forks source link

Revised testing architecture #196

Open idpaterson opened 8 years ago

idpaterson commented 8 years ago

The current approach to unit tests is quite verbose but makes it very easy to show what failed because each assertion contains the literal value that it is testing. I tried something different in this pull request.

Original

Take the simple date time tests for an example. The test has 82 lines of code to test 35 conditions:

    def testTimes(self):
        start = datetime.datetime(
            self.yr, self.mth, self.dy, self.hr, self.mn, self.sec).timetuple()
        target = datetime.datetime(
            self.yr, self.mth, self.dy, 23, 0, 0).timetuple()

        self.assertExpectedResult(
            self.cal.parse('11:00:00 PM', start), (target, 2))
        self.assertExpectedResult(
            self.cal.parse('11:00 PM', start), (target, 2))
        self.assertExpectedResult(
            self.cal.parse('11 PM', start), (target, 2))
        [...]
        self.assertExpectedResult(
            self.cal.parse('"11 p.m."', start), (target, 2))

        target = datetime.datetime(
            self.yr, self.mth, self.dy, 11, 0, 0).timetuple()

        self.assertExpectedResult(
            self.cal.parse('11:00:00 AM', start), (target, 2))
        self.assertExpectedResult(
            self.cal.parse('11:00 AM', start), (target, 2))
        [...]
        self.assertExpectedResult(
            self.cal.parse('(11 a.m.)', start), (target, 2))

        target = datetime.datetime(
            self.yr, self.mth, self.dy, 7, 30, 0).timetuple()

        self.assertExpectedResult(self.cal.parse('730', start), (target, 2))
        self.assertExpectedResult(self.cal.parse('0730', start), (target, 2))
        self.assertExpectedResult(self.cal.parse('0730am', start), (target, 2))

        target = datetime.datetime(
            self.yr, self.mth, self.dy, 17, 30, 0).timetuple()

        self.assertExpectedResult(self.cal.parse('1730', start), (target, 2))
        self.assertExpectedResult(self.cal.parse('173000', start), (target, 2))

Small improvement

In #195 wanted to test all of these conditions in nlp as well but converting all of those tests to expect the result from nlp (which includes the phrase that matched) just wasn't going to happen. Trying to follow this pattern with such repetitive results for the results from nlp was a whole new level of unmaintainability.

This approach allows prefixes and other programmatic modifications to all of the test targets for more comprehensive testing. For example, by running everything through the "wrap in quotes" test I found that "August 25" was not parsed.

There are now 62 lines of code for 66 tests:

    def testTimes(self):
        start = datetime.datetime(
            self.yr, self.mth, self.dy, self.hr, self.mn, self.sec).timetuple()
        targets = {
            datetime.datetime(self.yr, self.mth, self.dy, 23, 0, 0): (
                '11:00:00 PM',
                '11:00 PM',
                '11 PM',
                [...]
                '11 p.m.',
            ),
            datetime.datetime(self.yr, self.mth, self.dy, 11, 0, 0): (
                '11:00:00 AM',
                '11:00 AM',
                [...]
                '11 a.m.',
            ),
            datetime.datetime(self.yr, self.mth, self.dy, 7, 30, 0): (
                '730',
                '0730',
                '0730am',
            ),
            datetime.datetime(self.yr, self.mth, self.dy, 17, 30, 0): (
                '1730',
                '173000',
            )
        }

        for dt, phrases in targets.iteritems():
            # Time (2) phrase starting at index 0
            target = (dt, 2, 0)
            for phrase in phrases:
                self.assertExpectedResult(
                    self.cal.nlp(phrase, start),
                    (target + (len(phrase), phrase),)
                )

            # Wrap in quotes
            target = (dt, 2, 1)
            for phrase in phrases:
                self.assertExpectedResult(
                    self.cal.nlp('"%s"' % phrase, start),
                    (target + (len(phrase) + 1, phrase),)
                )

This isn't the best structure for tests but at least you can still easily see what failed:

tests/utils.py:22: in decoratedComparator
    self.fail(failureMessage % (errMsg, result, check))
E   AssertionError: Result does not match target value
E   
E       Result:
E       None
E   
E       Expected:
E       ((datetime.datetime(2016, 8, 25, 8, 49, 53), 1, 1, 10, u'August 25'),)

Share data between tests

If we were to store this test data outside of the test case itself we could split these two basic and wrapped in quotes tests into two different test methods. Pull in a fixture, apply modifications, pass it through nlp and test that you still get the expected output.

Share data between parse and nlp

With a test fixture format like this we could also share test cases between nlp and parse to ensure that they behave consistently.

pytest has excellent support for creating and loading fixtures. It looks like pytest could offer much cleaner and more reusable tests than unittest does since data could be loaded with decorators rather than forcing a subclass syntax for each group of tests using the same data. I'm not an expert in python unit testing so someone else will probably have more useful input here.

Share test cases between locales

Taking it another step, the fixture data could be translated for each locale. This would allow for much more thorough testing and easier test setup for locales.

The tests above could be boiled down to something like this that runs through a full collection of test cases for each locale.

@pytest.mark.parametrize('cal,sourceTime,phrase,dt', simpleTimesByLocale)
def test_simple_times_wrapped_in_quotes(cal, sourceTime, phrase, dt):
    mod_phrase = u'"%s"' % phrase
    assert cal.nlp(mod_phrase, sourceTime) == (dt, 2, 1, len(phrase) + 1, phrase)

Here we have 4 lines of code for over 200 tests with a very clear test name and single set of arguments that will be logged with a failure. Of course there would be a lot of logic needed to generate the per-locale test data, this may or may not be the way to do it. In some cases it may not be necessary to balloon the number of tests this much, but in other cases it may help prevent regressions and inconsistencies.

You will also see each failure rather than the entire test stopping at the first failed input. A max failure setting would avoid showing thousands of failures if the contributor really broke something.

Special considerations for parse

When the above simple modified test for nlp fails, it is easy to see what was being tested because the assertion includes the phrase. However, parse does not include the phrase in its return value so looping over values in that way would give no insight into what actually failed.

Parameterized test functions like the one shown above would not have this problem since the arguments are printed with the failure message and each invocation has only a single datapoint to test rather than looping over a list.

Data format

Any suggestions for a good format for this kind of test fixture data? YAML is probably fairly common and might be easier to read and translate than python but it introduces the potential difficulty of trying to structure something that would be easier in python.

The format shown above is not portable enough, just something I set up to quickly get these few tests running. The parametrized example above assumes some intermediary processing to flatten the data format down to a tuple of (calendar, source time, phrase, expected datetime), we wouldn't want the test data to have to specify all of those explicitly for every case.

bear commented 8 years ago

I like this idea and direction a lot - especially the thought that whatever data file that is used to generate the test cases can then be translated (or adjusted) for locales and stored with the locale.

Having a *.test.yaml (or something like that) for each locale that is then processed and fed into the test loop would help with this. The test description would need to cover two different scenarios - tests for things that should parse and tests for things that shouldn't parse, not sure if that is needed for nlp

pytest, IIRC, allows for the use of parameterized tests using yield, see http://remusao.github.io/pytest-paramaterize-tests-with-external-data.html for what I'm talking about.

idpaterson commented 8 years ago

I'm very interested in working on a new approach to testing before getting too deep on features and fixes for v3, glad to hear that you are interested as well @bear. I think it would be best to start with a goal of replicating the current tests (or at least avoiding any new tests that break) before leveraging the reusable cases to amplify the test coverage.

The simple times test cases for English might look something like this in yaml:

simple_times:
    -
        source_time: 2016-01-01 00:00:00
        target: 2016-01-01 23:00:00
        phrases:
            - "11:00:00 PM"
            - "11:00 PM"
            - "11 PM"
            - "11PM"
            - "2300"
            - "23:00"
            - "11p"
            - "11pm"
            - "11:00:00 P.M."
            - "11:00 P.M."
            - "11 P.M."
            - "11P.M."
            - "11p.m."
            - "11 p.m."
    -
        source_time: 2016-01-01 00:00:00
        target: 2016-01-01 23:00:00
        phrases:
            - "11:00:00 AM"
            - "11:00 AM"
            - "11 AM"
            - "11AM"
            - "1100"
            - "11:00"
            - "11a"
            - "11am"
            - "11:00:00 A.M."
            - "11:00 A.M."
            - "11 A.M."
            - "11A.M."
            - "11a.m."
            - "11 a.m."
    -
        source_time: 2016-01-01 00:00:00
        target: 2016-01-01 07:30:00
        phrases:
            - "730"
            - "0730"
            - "0730am"
    -
        source_time: 2016-01-01 00:00:00
        target: 2016-01-01 17:30:00
        phrases:
            - "1730"
            - "173000"

I included quotes around all of the phrases despite only being required for a few (e.g. 23:00 parses as 1380 and 0730 parses as 730) to simplify translations without getting encumbered in yaml syntax.

idpaterson commented 8 years ago

I've been browsing through the tests to get a grasp on what would be needed in the YAML structure. A few observations for tonight:

YAML should be parsed using OrderedDict instead of normal dictionaries if possible to avoid inconsistencies between test runs and to maintain consistency with the order in the data file.

Many tests include phrases that each map to distinct datetimes. For example, tests for special times would be more convenient to represent like this:

special_times:
    - &special_times_2016-01-01
        source_time: 2016-01-01 00:00:00
        phrases:
            "noon": 2016-01-01 12:00:00
            "afternoon": 2016-01-01 13:00:00
            "lunch": 2016-01-01 12:00:00
            "morning": 2016-01-01 06:00:00
            "breakfast": 2016-01-01 08:00:00
            "dinner": 2016-01-01 19:00:00
            "evening": 2016-01-01 18:00:00
            "midnight": 2016-01-01 00:00:00
            "night": 2016-01-01 21:00:00
            "tonight": 2016-01-01 21:00:00
            "eod": 2016-01-01 17:00:00
    # We should get the same times with a source later in the day
    - 
        <<: *special_times_2016-01-01
        source_time: 2016-01-01 12:13:14

YAML &anchors also provide a convenient way to run the same set of phrases against a different source time.

It might be useful to allow the configuration level here to specify other settings like DOWParseStyle. That could [and maybe should] be done in the test case but it would be nice to at least have a consistent way to represent the state of different settings in the data file so that the data file does not end up with multiple sets of results that behave differently for no obvious reason. If specified in the test fixture the configured calendar could be passed in as an argument to each test as would be needed for locales.

example:
    -
        dow_parse_style: 1
        source_time: 2016-01-01 00:00:00
        phrases:
            ...
bear commented 8 years ago

I like the last example you gave - being able to have the tests specify as the key the phrase and the value the expected result. Instead of having dow_parse_style: 1 be a peer of source_time and phrases I would suggest we add a more explicit section name and then allow anything inside of it to change any pdc constant

idpaterson commented 8 years ago

Good point, I'm thinking it should only be at the top level rather than in each source/target collection like I showed. So all cases within the named group use the same configuration environment and the name of the group gives some indication to that configuration.

So within en.yml or en/something.yml we would have appropriately-named test groups each with the ability to set configuration options for all test cases in the group. Anchors make this quite readable and I am leaning toward en/something.yml structure rather than a single file per language to keep the data grouped similarly to the current test classes.

Here is an example of a few test cases for weekdays and the DOWParseStyle flag:

constants:
    - &tuesday 2016-01-19 00:00:00
    - &wednesday 2016-01-20 00:00:00
    - &thursday 2016-01-21 00:00:00
    - &next_wednesday 2016-01-27 00:00:00
    - &wednesday_phrases
        - "Wednesday"
        - "wed"
        - "Wed."

weekdays:
    cases:
        - source_time: *tuesday
          phrases: *wednesday_phrases
          target: *wednesday
        # Since the Wednesday of this week has passed we should get the
        # Wednesday of the following week
        - source_time: *thursday
          phrases: *wednesday_phrases
          target: *next_wednesday

weekdays_dow_parse_style_0:
    options: 
        dow_parse_style: 0
    cases:
        - source_time: *tuesday
          phrases: *wednesday_phrases
          target: *wednesday
        # The Wednesday of this week has passed but the parse style should
        # give us that day anyway
        - source_time: *thursday
          phrases: *wednesday_phrases
          target: *wednesday

The named anchors make the test cases read nicely without getting bogged down in the specific dates. This parses to:

{
    'constants': [datetime.datetime(2016, 1, 19, 0, 0), datetime.datetime(2016, 1, 20, 0, 0), datetime.datetime(2016, 1, 21, 0, 0), datetime.datetime(2016, 1, 27, 0, 0), ['Wednesday', 'wed', 'Wed.']],
    'weekdays': {
        'cases': [{
            'phrases': ['Wednesday', 'wed', 'Wed.'],
            'target': datetime.datetime(2016, 1, 20, 0, 0),
            'source_time': datetime.datetime(2016, 1, 19, 0, 0)
        }, {
            'phrases': ['Wednesday', 'wed', 'Wed.'],
            'target': datetime.datetime(2016, 1, 27, 0, 0),
            'source_time': datetime.datetime(2016, 1, 21, 0, 0)
        }]
    },
    'weekdays_dow_parse_style_0': {
        'cases': [{
            'phrases': ['Wednesday', 'wed', 'Wed.'],
            'target': datetime.datetime(2016, 1, 20, 0, 0),
            'source_time': datetime.datetime(2016, 1, 19, 0, 0)
        }, {
            'phrases': ['Wednesday', 'wed', 'Wed.'],
            'target': datetime.datetime(2016, 1, 20, 0, 0),
            'source_time': datetime.datetime(2016, 1, 21, 0, 0)
        }],
        'options': {
            'dow_parse_style': 0
        }
    }
}
idpaterson commented 8 years ago

We also need a plan for dealing with pdtContext. Certainly there will be a test suite dedicated to verifying the pdtContext accuracy flags, but I wonder if it's best to let other tests just ignore the pdtContext or to encode it in yaml?

It's no trouble to make a custom constructor in yaml like `!!pdtContext year|month``, but it would be too much to expect that to be implemented for every single phrase or test group.

For tests that do not care about the context we could add a constant for initializing a "wildcard" context. For example, pdtContext(ANY_CONTEXT) and then in pdtContext's __eq__ override check whether either is this ANY_CONTEXT value and return true. If there is no context specified in the test data, use this wildcard in the assertion.

bear commented 8 years ago

I'm loving the working example you have now -- +1 to it being implemented now :)

I would suggest we leave pdtContext testing alone for the moment and get comfortable with all of the other test changes first. My hunch is that we will need a special check like you have suggested that, if no context is given, it generates or does a compare as a special case because they all need to be checked using special crafted source times

idpaterson commented 8 years ago

Great! I think the naming could use some improvement (phrases, target maybe not the right words) but the general concept is solid enough to start getting translated into actual tests.

A !source_time_plus constructor that uses Calendar.inc() (for years and months) plus timedelta for the rest would be needed for relative dates:

constants:
    - !timedelta &delta_3y2w5d
        years: 3
        weeks: 2
        days: 5

years_weeks_and_days:
    cases:
        - source_time: 2016-01-01 00:00:00
          target: *delta_3y2w5d
          phrases: 
            - "3 years, 2 weeks, 5 days"
            - "3 years, 2 weeks and 5 days"
            - "3y, 2w, 5d"

nlp will need some special consideration as well since you would want to be able to target multiple dates in the string as well as the exact substrings that matched. The python tests for nlp could handle the normal format discussed so far for parse since the entire phrase should match. We would just need the ability to write data for nlp-specific tests that contain non-date words or multiple dates.

Earlier you mentioned testing false positives – phrases that do not contain a date expression. That looks like it will be a bit tricky since while nlp returns None when there are no matches, parse returns the source time with a blank context. I think it would be best to set target: null in the YAML for clarity. The python tests should not have any problem handling that - if you've loaded a data set that is meant to have false positives you can easily construct the proper result in the test. We just wouldn't want to mix real date expressions and false positives in the same test group.

Other things are likely to come up as I go along but I don't see any insurmountable obstacles to supporting the existing test cases in this new format. I'll start a pull request for this once I have something started. GitHub allows project contributors to collaborate on pull requests now which might come in handy.

A side motivation for migrating the test data to YAML is to make a very convenient repository from which to pull realistic examples. I'm definitely going to use the locale-specific tests to generate some documentation for date formatting support in my own project!

idpaterson commented 8 years ago

I think it's going well so far! This is the syntax I'm using at least for now to define a test:

@pdtFixture('deltas.yml', ['integer_values'])
def test_integer_value_deltas(cal, phrase, sourceTime, target):
    assert cal.parse(phrase, sourceTime) == (target.timetuple(), 2)

The pdtFixture decorator has a few arguments – the data file, the groups to pull from that data file, and an optional locale. If unspecified the test will be parametrized against this test data in all locales.

In this example, the deltas tests use time offsets as targets in YAML rather than explicit dates. The target provided by this decorator is calculated by adjusting the sourceTime with timedelta and Calendar.inc for these delta times. When writing the tests in python you don't need to put as much effort into calculating or manipulating dates since it is all expressed in the YAML files.

So where do these cal, phrase, sourceTime, target arguments come from? pdtFixture inspects the decorated function to grab its argument list and prepares data based on whatever the test needs. Then it sends the argument names as a comma-separated lists and all of the test parameters off to pytest.mark.parametrize. This avoids maintaining a separate comma-separated list of parameters like for pytest.mark.parametrize. The order of arguments in the test function does not matter since pytest will provide them by name, and additional parameters could be provided with a separate parameterization. For example, we could parametrize prefixes and suffixes:

@pdtFixture('simple.yml', ['simple_times'])
@pytest.mark.parametrize('prefix,suffix', (('"', '"'), ('(', ')'), ('[', ']')))
def test_times_with_common_prefixes_and_suffixes(cal, phrase, sourceTime, target, prefix, suffix):
    modphrase = u'%s%s%s' % (prefix, phrase, suffix)
    assert cal.parse(modphrase, sourceTime) == (target.timetuple(), 2)

The above example shows using the target parameter then formatting it into a parse response. The parametrization could also provide targetContext where the YAML is consulted for information about what the pdtContext should be or use that wildcard context if the test doesn't specify.

I originally had parseTarget as a parameter to provide the value of (target.timetuple(), 2) as an argument to the test but I think that crosses the line of too-much-magic. It's probably too abstract to just write tests like assert cal.parse(phrase, sourceTime) == parseTarget versus at least a bit more explicit assert cal.parse(phrase, sourceTime) == (target.timetuple(), targetContext)

The test failures are very clear with this approach, all the arguments are clear and the failure is shown with a diff:

================================================================================================ FAILURES =================================================================================================
_____________________________________________________________________ test_integer_value_deltas[cal2-2 days ago-sourceTime2-target2] ______________________________________________________________________

cal = <parsedatetime.Calendar object at 0x105f82510>, phrase = '2 days ago', sourceTime = datetime.datetime(2016, 1, 1, 0, 0), target = datetime.datetime(2015, 12, 30, 0, 0)

    @pdtFixture('deltas.yml', ['integer_values'])
    def test_integer_value_deltas(cal, phrase, sourceTime, target):
>       assert cal.parse(phrase, sourceTime) == (target.timetuple(), 2)
E       assert (time.struct_..._isdst=-1), 1) == (time.struct_t..._isdst=-1), 2)
E         At index 1 diff: 1 != 2
E         Use -v to get the full diff

tests/test_delta.py:9: AssertionError

If one parametrized python test function has 10 successes you'll see 1 passed for that test, but if one python test function has 5 successes and 5 failures you'll see 5 failed. The successful cases for a function with failures are not displayed at least by default.

Lots of cool possibilities here.

bear commented 8 years ago

This work you are doing has me very excited!

being able to make our testing easier and with better coverage for locales == HUGE win!

much thanks!

idpaterson commented 8 years ago

Here is an example of testing pdtContext. It's actually easier to work with pdtContext from the start than trying to get the correct integer in there. Based on the point earlier that if a test case does not specify the context we can provide a wildcard context, pdtContext is more forgiving but it's also just easy enough to write it in to tests from the start. Though the tests will be finished up before pdtContext is made the default I think this is the best way to proceed to avoid extra work trying to deal with the integer day/time/unit flag, which is derived from pdtContext.

A pdtContext match failure looks like this. Again, pytest is very clear as to what's wrong and the -v flag provides a full diff for less obvious failures:

_______________________________________________________________ test_integer_value_deltas[cal0-5 minutes ago-sourceTime0-target0-context0] ________________________________________________________________

cal = <parsedatetime.Calendar object at 0x106109550>, phrase = '5 minutes ago', sourceTime = datetime.datetime(2016, 1, 1, 0, 0), target = datetime.datetime(2015, 12, 31, 23, 55)
context = pdtContext(accuracy=pdtContext.ACU_HOUR | pdtContext.ACU_MIN)

    @pdtFixture('deltas.yml', ['integer_values'])
    def test_integer_value_deltas(cal, phrase, sourceTime, target, context):
>       assert cal.parse(phrase, sourceTime) == (target.timetuple(), context)
E       assert (time.struct_...text.ACU_MIN)) == (time.struct_t...text.ACU_MIN))
E         At index 1 diff: pdtContext(accuracy=pdtContext.ACU_MIN) != pdtContext(accuracy=pdtContext.ACU_HOUR | pdtContext.ACU_MIN)
E         Use -v to get the full diff

tests/test_delta.py:9: AssertionError

In this case I made the test fail by writing a bad test:

integer_values:
    sourceTime: *arbitrary_date
    cases:
        - context: !pdtContext minute | hour
          target: !timedelta
            minutes: -5
          phrases:
            - "5 minutes ago"
        - context: !pdtContext hour
          target: !timedelta
            hours: -34
          phrases:
            - "34 hours ago"
        - context: !pdtContext day
          target: !timedelta
            days: -2
          phrases:
            - "2 days ago"

Similar to sourceTime, the context can be specified at the test group level if it is going to be the same across all cases.

years_weeks_and_days:
    sourceTime: *arbitrary_date
    context: !pdtContext year | week | day
    cases:
        - target: *delta_3y2w5d
          phrases: 
            - "3 years, 2 weeks, 5 days"
            - "3 years, 2 weeks and 5 days"
            - "3y, 2w, 5d"
    cases:
        - target: *delta_0y1w0d
          phrases: 
            - "0 years, 1 week, 0 days"
            - "0 years, 1 week and 0 days"
            - "0y, 1w, 0d"
bear commented 8 years ago

+1 to the new keywords and I see what your saying about just using pdtContext from the start -- these examples are already showing the benefits of this new setup: easier to read tests

idpaterson commented 8 years ago

Just checking back in here. It's going very well, I'm going to start that pull request soon to share the work I've done so far.

No major changes, the tests look pretty similar to what was described above. Here is an example of a fairly ugly test case:

@pdtFixture('names.yml')
def test_month_names(cal, phrase, sourceTime, target, context, assertLazyStructTimes):
    result = cal.parse(phrase, sourceTime)
    assertLazyStructTimes(result[0], target.timetuple())
    assert result[1] == context

There are a few things going on here:

  1. @pdtFixture will use the test name to select the correct test group from the data file if the test group names are not provided explicitly (in this case it uses the test group month_names)
  2. assertLazyStructTimes is a pytest fixture for handling time.struct_time values where the month day, year day, and daylight savings time flags are incorrect.

All of the month name tests failed because parsedatetime returns technically invalid time.struct_time values where the mday, yday, and isdst fields are not updated and therefore match the source time. I tried a few different approaches to handling this assertion once I hit this batch of tests but this seemed to be the best. I erred on the side of less magic here so that if the return values are fixed the tests can easily be grepped and reverted to normal assert cal.parse(phrase, sourceTime) == (target.timetuple(), context) assertions.

Also an example of testing something other than parse where you don't need sourceTime or context:

# Target values as expected from Calendar._convertUnitAsWords
numbers_as_words:
    cases:
        - target: 0
          phrases: 
            - "zero"
        - target: 11
          phrases:
            - "eleven"
        - target: 133000400314
          phrases:
            - "one hundred thirty three billion, four hundred thousand three hundred fourteen"
            - "one hundred thirty-three billion four hundred thousand three hundred and fourteen"
@pdtFixture('numbers_as_words.yml')
def test_numbers_as_words(cal, phrase, target):
    assert cal._convertUnitAsWords(phrase) == target
bear commented 8 years ago

thanks for the update!

I agree completely with having "less magic" and easily grep'able tests - solving it with a fixture seems ok.