Testing framework discussion

jchodera commented 9 years ago

This thread is to start a new discussion around how we can properly validate pymbar and similar codes using an automated testing framework.

We have/need several kinds of tests:

Doctests. These test the documentation for each function to make sure that it is accurate.
Unit tests. Ensure that each component is behaving as expected. Currently implemented as nose tests, but we may not have complete coverage.
Regression tests. Tests that make sure the code produces the same answer as the last version. I don't think we have these implemented, but we may not need them.
Example tests. Tests that ensure the examples correctly run.
Statistical tests. Tests that ensure that the code produces the correct statistical results.

One issue is whether we need regression tests (to ensure our results don't change) or if statistical tests (that ensure the overall correctness of the code on several model systems) are sufficient.

I believe there is a correct way to build statistical tests, but this involves selecting a significance threshold for a hypothesis test such that failure indicates there really is an error. That means that we should pick a threshold such that failure by chance is extremely unlikely---such as p = 1e-6---and then correct the p-value threshold for the total number of tests we expect to run in the lifetime of the project. This kind of correction would yield p / (Ntests * Ncommits) as a new threshold for significance, which we can then convert back to a sigma, for example, for testing a normally distributed statistic.

mrshirts commented 9 years ago

Doctests. These test the documentation for each function to make sure that it is accurate. Unit tests. Ensure that each component is behaving as expected. Currently implemented as nose tests, but we may not have complete coverage.

Can we make the distinction between these two? Right now most components are functions.

I can gradually add more test for function paths (different PMF zeros, etc). We also need to add tests for exp, Gauss, and BAR.

Regression tests. Tests that make sure the code produces the same answer as the last version. I don't think we have these implemented, but we may not need them.

Regression tests are always a good idea, so that if something changes, one knows where the changes occur.

One issue is whether we need regression tests (to ensure our results don't change) or if statistical tests (that ensure the overall correctness of the code on several model systems) are sufficient.

There are changes that could be wrong that can hang out below the level of statistical precision.

The harmonic_oscillators.py has fairly good coverage (all functions, and most code paths), so a quick and dirty way is to simply save the output, and have a script that performs a diff with each travis CI run.. A slightly more complicated one would save each result as a pickle, and test equality to near-machine precision.

statistical tests (that ensure the overall correctness of the code on several model systems) are sufficient.

So, with sufficiently large N, then the distribution of errors is indeed normal. The Anderson-Darling test seems to be the working well as something to base it tests on, since it has (decent) confidence intervals. Right now, on harmonic_oscillators_distributions.py is only giving one failure, for expectations for an unsampled state (with 2000 samples and 200 replications). Statistical tests should be run automatically, but can be run manually with QQ plots, which allows for easy inspection of whether there's a normal distribution.

Example tests. Tests that ensure the examples correctly run.

I think for now just running the examples code and testing for completion is probably sufficient?

I believe there is a correct way to build statistical tests, but this involves selecting a significance threshold for a hypothesis test such that failure indicates there really is an error. That means that we should pick a threshold such that failure by chance is extremely unlikely---such as p = 1e-6---and then correct the p-value threshold for the total number of tests we expect to run in the lifetime of the project. This kind of correction would yield p / (Ntests * Ncommits) as a new threshold for significance, which we can then convert back to a sigma, for example, for testing a normally distributed statistic.

I think it might be better to just to (p / Ntests), as we don't know the number of commits beforehand. Is the correctness of the code contingent on the number of commits there are? I'd rather have something that cut a little too close, so that rather than trudging along with a code that has a decent chance of being wrong (but below threshold), one is forced to more carefully inspect a code that might be wrong, but could be right.

Also note that some tests have correlations; if some are wrong, it's more likely that others are.

mrshirts commented 9 years ago

Oh, one more! Performance testing is important. We want to see how key problems get better or worse performance over time. This can likely be triggered manually (will take a while to run, don't want to trigger it on commit), but should run automatically. I know Kyle has some scripts he's used.

jchodera commented 9 years ago

Can we make the distinction between these two? Right now most components are functions.

We discussed this at length in other projects, and concluded that the main value of doctests is essentially to (1) encourage examples to be provided in the docstring of each function illustrating its use, and (2) ensure that these examples actually run.

The unit tests (currently written as nose tests) are in a separate file, and attempt to verify all of the various components (functions, groups of functions, or classes) are working properly. This helps catch issues in various components intended to be used alone or together, as well as helps track down what went wrong when some complicated collection of components fails.

I can gradually add more test for function paths (different PMF zeros, etc). We also need to add tests for exp, Gauss, and BAR.

Gradually adding more tests is a great way to get them written.

Regression tests are always a good idea, so that if something changes, one knows where the changes occur.

Agreed. We do have to worry about making the results deterministic, but as long as we can ensure the same random test data is generated, or use stored test data, then we can do this.

The harmonic_oscillators.py has fairly good coverage (all functions, and most code paths), so a quick and dirty way is to simply save the output, and have a script that performs a diff with each travis CI run. A slightly more complicated one would save each result as a pickle, and test equality to near-machine precision.

You're presumably suggesting this as a regression test. Here's how I would do this:

Write a nose test called tests/test_harmonic_oscillators_regression.py that has a __main__ section that, when run, spits out some data into tests/regression-data/. The nose test function, called test_regression() or something inside the file, reads this data and compares the newly computed results against this. That way, we just have to run these files once from the command-line to (re)generate the data, and all the code is stored together in one file.

There are changes that could be wrong that can hang out below the level of statistical precision.

Yes, but that's why we want to have (1) a lot of test coverage, and (2) to use tests where we can generate a large amount of data so that the statistical error will be very small. After all, a test is not useful if the failure of a test does not indicate a failure of the code.

So, with sufficiently large N, then the distribution of errors is indeed normal. The Anderson-Darling test seems to be the working well as something to base it tests on, since it has (decent) confidence intervals. Right now, on harmonic_oscillators_distributions.py is only giving one failure, for expectations for an unsampled state (with 2000 samples and 200 replications). Statistical tests should be run automatically, but can be run manually with QQ plots, which allows for easy inspection of whether there's a normal distribution.

They should be run automatically, but when a failure is detected, can generate QQ plots for manual inspection to help diagnose errors.

Have you documented the Anderson-Darling test somewhere?

I think for now just running the examples code and testing for completion is probably sufficient?

This is fine for now. Later, we can also compare the output to stored output to add regression testing.

I think it might be better to just to (p / Ntests), as we don't know the number of commits beforehand.

We can estimate this to within an order of magnitude. pymbar has had about 200 commits/year in the 20 months it has been on github. If we estimate the lifetime of a piece of software to be at most 10 years, that is 200*10 = 2000 commits, so we will say no more than 10^4 commits and pull requests is to be expected over its lifetime.

jchodera commented 9 years ago

Oh, one more! Performance testing is important. We want to see how key problems get better or worse performance over time. This can likely be triggered manually (will take a while to run, don't want to trigger it on commit), but should run automatically. I know Kyle has some scripts he's used.

Could be harder to do this on travis since I'm not sure we're guaranteed to have identical testing environments, but I'm fully supportive of creating an automated benchmark script to run this on a standard suite. We can have our jenkins server run this on a number of machines and upload the benchmark results.

rmcgibbo commented 9 years ago

Unsolicited advice: I wouldn't bite off more than you can chew with testing. IMHO, better to spend more time working incrementally at adding new tests than extensive planning of all the bells and whistles that an ideal test suite might contain.

On Sat, Jan 3, 2015 at 7:26 PM, John Chodera notifications@github.com wrote:

Oh, one more! Performance testing is important. We want to see how key problems get better or worse performance over time. This can likely be triggered manually (will take a while to run, don't want to trigger it on commit), but should run automatically. I know Kyle has some scripts he's used.

Could be harder to do this on travis since I'm not sure we're guaranteed to have identical testing environments, but I'm fully supportive of creating an automated benchmark script to run this on a standard suite. We can have our jenkins http://jenkins.choderalab.org server run this on a number of machines and upload the benchmark results.

— Reply to this email directly or view it on GitHub https://github.com/choderalab/pymbar/issues/137#issuecomment-68615462.

jchodera commented 9 years ago

That's still the plan! Small, incremental changes. But still worth discussing where we're headed, even if it is just one step at a time.

mrshirts commented 9 years ago

I'll take gradually adding nosetests and working on a regression test over the next week or two. Perhaps Kyle can work on the performance testing?

Have you documented the Anderson-Darling test somewhere?

There's some documentation in the comments in confidence_intervals.py and in the README.py in harmonic_oscillators_distributions.py. And a link to the wikipedia page :)

kyleabeauchamp commented 9 years ago

So I already have a short benchmark script (https://github.com/choderalab/pymbar/blob/master/scripts/benchmark_mbar.py). IMHO, we don't want to invest more time in automated performance testing given the huge list of other tasks. Small changes in performance between commits are OK--whereas small changes in correctness are obviously unacceptable.

jchodera commented 9 years ago

IMHO, we don't want to invest more time in automated performance testing given the huge list of other tasks.

Agreed, no need to do this now. Can punt to sometime in the future.

mrshirts commented 9 years ago

What we do want to do is avoid substantial changes in efficiency in some functions that happens without our awareness.

choderalab / pymbar

Testing framework discussion #137