choderalab / pymbar

Python implementation of the multistate Bennett acceptance ratio (MBAR)
http://pymbar.readthedocs.io
MIT License
241 stars 92 forks source link

Switch from distutils to setuptools to take advantage of test automation #6

Closed jchodera closed 10 years ago

jchodera commented 11 years ago

We can more easily automate tests via setup.py with "python setup.py test" using setuptools.

kyleabeauchamp commented 11 years ago

What's the advantage of using setuptools here? My worry is that the sad state of python packaging might indicate limits of what you can do using setuptools.

I don't think we've been using setuptools integration for anything--we've just been using nose.

jchodera commented 11 years ago

setuptools supports running tests in a tests/ subdirectory automatically via python setup.py test, while distutils does not. There's probably a better testing system we should be using, though. I know very little about good testing practices for Python other than doctest and manually-run integration tests, but would like to move to fully automated testing at three levels:

What do you guys use?

kyleabeauchamp commented 11 years ago

Our tests aren't perfect, but we have some doctests, unit tests, and integration tests.

The best places to look are on MDTraj.

You can see our travis integration here:

https://github.com/rmcgibbo/mdtraj/blob/master/.travis.yml

The tests are here:

https://github.com/rmcgibbo/mdtraj/tree/master/MDTraj/tests

Here is the reference data:

https://github.com/rmcgibbo/mdtraj/tree/master/MDTraj/testing/reference

kyleabeauchamp commented 11 years ago

Also, my knowledge of testing is limited to following the templates that Robert set up, so I've never actually done a detailed comparison. In general, though, Robert chooses "industry standard" best practices.

kyleabeauchamp commented 11 years ago

Robert also likes to use a bunch of "helper" functions for equality comparison and reference data loading. I'm not sure if this is necessary or not, but you could probably steal some things like these:

from mdtraj.testing import skipif, get_fn, eq, slow
jchodera commented 11 years ago

Thanks, @kyleabeauchamp! I've set up travis-ci to test the build/install process with pymbar already (see the badge on the README.md on the main page) but it currently just runs two random tests to see if they complete. I'll dig into the examples from MDTraj to see if we can use a similar style for pymbar and yank.

kyleabeauchamp commented 11 years ago

This test suite is fairly straightforward, might be good as a model:

https://github.com/rmcgibbo/mdtraj/blob/master/MDTraj/tests/test_topology.py

kyleabeauchamp commented 11 years ago

One more thing that might be of interest: the python-netcdf4 people just uploaded a new (broken) version, which recently killed our MDTraj tests. If you run into this issue, see https://github.com/rmcgibbo/mdtraj/issues/150

rmcgibbo commented 11 years ago

Here are my two cents, since I saw this linked on rmcgibbo/mdtraj#150

  1. setuptools should be prefered over distutils, but not just because of setup.py test. mostly because of lots of other fixes and improvements, including versioned installs, the ability to use setup.py develop during development, the install_requires keyword to setup, and others. Don't use numpy.distutils. It is crap and @msultan and I have found multiple bugs.
  2. numpy.testing is your friend. Functions like np.testing.assert_array_almost_equal are much clearer than using something link assert np.linalg.norm(x,y) < 1e-8. the np.testing methods also give nice informative error messages when they break about what percent of the arrays are different, etc.
  3. nose is great. using unittest.TestCase directly is lots of extra boilerplate. With nose, your tests can just be functions defined at the module level like test_propertyX() that run some asserts. If you want to get fancy with setup and teardowns, you can, but it's often not needed.
  4. including data files in your installed package is kind of painful, and is one of the things that varies between different distutils/setuptools/etc. mdtraj is using setuptool's pkg_resources internally. If you can avoid distributing data files, that's good. Also, documenting exactly what all of the data files are is a never-ending pain in the rear.
  5. doctest is crap. the whole concept of checking for string equality is stupid, especially with numerical code on floating point operands. mdtraj has had multiple "test failures" from doctest where it'll say something stupid like "array([0.0, 0.0, 0.0]) == array([1.53e-17, 1.53e-17, 1.53e-17])" failed, wherein the "expected output" was created with a slightly different build settings (numpy ABI, etc), so a small change in floating point math triggers an error.
  6. unit tests are good. every time you fix a bug, add a unit test that makes sure that whatever aspect of was bugging out earlier is now tested.
rmcgibbo commented 11 years ago

Also, I would recommend switching your docstrings into numpydoc format. That way, you can pretty easily use the sphinx autodoc tools to make pretty good api documentation for free. https://github.com/numpy/numpydoc

kyleabeauchamp commented 11 years ago

@jchodera @mrshirts Any thoughts on Robert's comments here? My vote is to start reformatting our tests to look like MDTraj.

This change will likely cause a couple of days of pain, but in the long run I think it could be really great to have all our MD python packages behaving similarly from the perspectives of both users and developers.

jchodera commented 11 years ago

I am fully in favor of this! The travis-CI scripts should be updated to perform automated testing. And, of course, we need to rework all the tests to indicate whether they succeeded or failed.

On Sep 5, 2013, at 10:32 PM, kyleabeauchamp notifications@github.com wrote:

@jchodera @mrshirts Any thoughts on Robert's comments here? My vote is to start reformatting our tests to look match MDTraj.

This change will likely cause a couple of days of pain, but in the long run I think it could be really great to have all our MD python packages behaving similarly from the perspectives of both users and developers.

— Reply to this email directly or view it on GitHub.

mrshirts commented 11 years ago

Based on the other conversations, I think that revising the tests is a great next step.

In terms of testing functionality, the harmonic-oscillators.py has simple, quick tests of everything. For simple success/failure, then it's simply a matter of fixing the random seed so that the results are reproduced.

It's still useful to run with different random number seeds, since you can check to make sure that the errors stay within 1-2 sigma each time. The AUTOMATED version of this check of distributions is intended to be harmonic-oscillators-distributions.py, but that's not as complete.

The current code is very robust, if a bit ugly and slow (you should have seen it before!) and I think that is more valuable than a speed up. In very few cases is the speed a limiting factor now.

The best way to make sure that rewrites don't break the robustness is to ensure that all of the failure modes are tested somewhere that required the current approaches. So I think that there is significant downside to a complete rewrite until we're sure we can detect when new algorithms are likely to fail for some numerical stability / overflow / underflow reason.

Besides the quick tests, there need to be larger tests to handle all the robustness problems we encounter. This will almost certainly require storing the data, because many of them can't be simply regenerated with harmonic oscillators. Let me know where to commit them and I will, though my time is about 2x overcommitted, and before Kyle started looking at this, I hadn't scheduled significant time to work on this until Nov/Dec.

On Fri, Sep 6, 2013 at 3:55 AM, John Chodera notifications@github.comwrote:

I am fully in favor of this! The travis-CI scripts should be updated to perform automated testing. And, of course, we need to rework all the tests to indicate whether they succeeded or failed.

On Sep 5, 2013, at 10:32 PM, kyleabeauchamp notifications@github.com wrote:

@jchodera @mrshirts Any thoughts on Robert's comments here? My vote is to start reformatting our tests to look match MDTraj.

This change will likely cause a couple of days of pain, but in the long run I think it could be really great to have all our MD python packages behaving similarly from the perspectives of both users and developers.

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHubhttps://github.com/choderalab/pymbar/issues/6#issuecomment-23924565 .

jchodera commented 11 years ago

In terms of testing functionality, the harmonic-oscillators.py has simple, quick tests of everything. For simple success/failure, then it's simply a matter of fixing the random seed so that the results are reproduced.

I think we still need to add an actual test that computes whether all of the quantities are within the expected range (say, six sigma) of the analytical results.

mrshirts commented 11 years ago

Currently, they are all within 1-2 sigma, with roughly proper distribution, so a reproducibility test will get that.

Statistical tests should be statistical.

An automated test of statistical accuracy should test more like a P-value, rather than a strict sigma cutoff, and the harmonic-oscillator-distributions.py code is partly designed for this.

Strict sigma cutoffs are not a good idea: 6 sigma is probably worthless data, but lots of times things are greater than 2 sigma.

On Fri, Sep 6, 2013 at 9:22 AM, John Chodera notifications@github.comwrote:

In terms of testing functionality, the harmonic-oscillators.py has simple, quick tests of everything. For simple success/failure, then it's simply a matter of fixing the random seed so that the results are reproduced.

I think we still need to add an actual test that computes whether all of the quantities are within the expected range (say, six sigma) of the analytical results.

— Reply to this email directly or view it on GitHubhttps://github.com/choderalab/pymbar/issues/6#issuecomment-23939165 .

jchodera commented 11 years ago

Statistical tests should be statistical.

I could not disagree more. There is absolutely no value in a test where test failure means that things might not be broken but have failed due to a statistical fluke. If a test fails, it should mean something is broken. That's why we should use a p-value sufficiently strict so that no failures are ever expected by chance in the entire lifetime of the project (potentially with the appropriate Bonferroni correction for the number of tests we run).

mrshirts commented 11 years ago

I could not disagree more.

I think we need to be more clear about what each other is saying before we decide if we disagree or not :)

By saying a statistical test should be statistical means that we have to run a very large number of repetitions -- we can't look at a single repetition. Certainly with a single repetition, if you test vs a strict sigma cutoff, then either it will be too strict (will fail frequently), or be worthless, because bad code/data will pass. That's what the harmonic-oscillators-distribution.py test is doing -- essentially it's doing q-q testing (that could be improved).

However, you don't want to or need to run a large number of repetitions if you are just checking if you break the code. For those tests, you want something deterministic test -- any change outside rounding errors is bad, in either the code or the analytical error. Statistical testing only needs to be done when the algorithm changes in a a way that is expected to change the answers. Code changes that don't change the expected answers don't need the extra statistical testing.

Note sure if I'm quite clear what I mean, but this is closer :)

On Fri, Sep 6, 2013 at 9:47 AM, John Chodera notifications@github.comwrote:

Statistical tests should be statistical.

I could not disagree more. There is absolutely no value in a test where test failure means that things might not be broken but have failed due to a statistical fluke. If a test fails, it should mean something is broken. That's why we should use a p-value sufficiently strict so that no failures are ever expected by chance in the entire lifetime of the project (potentially with the appropriate Bonferroni correction for the number of tests we run).

— Reply to this email directly or view it on GitHubhttps://github.com/choderalab/pymbar/issues/6#issuecomment-23940758 .

jchodera commented 11 years ago

We can run one or many repetitions of a statistical test, but testing whether the results are within an expected tolerance is still inherently statistical.

Suppose we ran 100 replicates of a statistical test and used a p-value of 0.05 without correction: on average, we'd see five failures.

If we used a p-value of 0.05 with a Bonferroni correction, then there is still a 5% probability that there will be at least one failure.

My point is just that, no matter how many tests/replicates we run, we need to use a p-value that is incredibly unlikely to ever occur so that we know that when a test fails, something is wrong. If we expect to have less than 1e5 commits, then we can use a p-value of 1e-6 and correct this for the number of statistical tests evaluated during automated testing. That will ensure that failures are real failures.

kyleabeauchamp commented 11 years ago

So I think we might be over-engineering this a bit here. The principle "incisive tests with few false positives" sums up what we want, but we probably don't need to spend a lot of time designing the test p tests. Just having more tests is a major improvement...

kyleabeauchamp commented 11 years ago

What I'm trying to say is--we can worry more about this when we come to it.

mrshirts commented 11 years ago

Right, I think the two obvious steps are:

1) take the well tested, well-valided tests in harmonic-oscillators.py and make them easier to use, esp. in automated ways.

2) assemble a more rigorous stress test that we throw in essentially every test that people have emailed us where they broke MBAR (trying to keep each of those data sets down to a few MB). Make it easy to run these, but they don't need to be broken into unit tests yet -- we just need reproducibility. Once we have a framework set up, I'll dump tests in.

Then we start going crazy on code improvements.

On Fri, Sep 6, 2013 at 10:13 AM, kyleabeauchamp notifications@github.comwrote:

What I'm trying to say is--we can worry more about this when we come to it.

— Reply to this email directly or view it on GitHubhttps://github.com/choderalab/pymbar/issues/6#issuecomment-23942497 .

mrshirts commented 11 years ago

Yeah, I'm thinking more about the false negatives. That's what really can kill things. I'd rather have something flagged as being potentially bad than something bad not being flagged at all.

On Fri, Sep 6, 2013 at 10:09 AM, John Chodera notifications@github.comwrote:

We can run one or many repetitions of a statistical test, but testing whether the results are within an expected tolerance is still inherently statistical.

Suppose we ran 100 replicates of a statistical test and used a p-value of 0.05 without correction: on average, we'd see five failures.

If we used a p-value of 0.05 with a Bonferroni correction, then there is still a 5% probability that there will be at least one failure.

My point is just that, no matter how many tests/replicates we run, we need to use a p-value that is incredibly unlikely to ever occur so that we know that when a test fails, something is wrong. If we expect to have less than 1e5 commits, then we can use a p-value of 1e-6 and correct this for the number of statistical tests evaluated during automated testing. That will ensure that failures are real failures.

— Reply to this email directly or view it on GitHubhttps://github.com/choderalab/pymbar/issues/6#issuecomment-23942278 .

jchodera commented 11 years ago

2) assemble a more rigorous stress test that we throw in essentially every test that people have emailed us where they broke MBAR (trying to keep each of those data sets down to a few MB). Make it easy to run these, but they don't need to be broken into unit tests yet -- we just need reproducibility. Once we have a framework set up, I'll dump tests in.

I'd prefer to keep the main distribution under 1 MB if at all possible. Maybe we can keep the tests that involve datasets in another repository (e.g. pymbar-tests)?

mrshirts commented 11 years ago

Yes, this is exactly what I was intending. It's somewhere in the thread there! This is intended to be a separate distribution. Normal users don't need to stress test. If you actually want to be developing pymbar, you'll be willing to download more data.

On Fri, Sep 6, 2013 at 10:52 AM, John Chodera notifications@github.comwrote:

2) assemble a more rigorous stress test that we throw in essentially every test that people have emailed us where they broke MBAR (trying to keep each of those data sets down to a few MB). Make it easy to run these, but they don't need to be broken into unit tests yet -- we just need reproducibility. Once we have a framework set up, I'll dump tests in.

I'd prefer to keep the main distribution under 1 MB if at all possible. Maybe we can keep the tests that involve datasets in another repository (e.g. pymbar-tests)?

— Reply to this email directly or view it on GitHubhttps://github.com/choderalab/pymbar/issues/6#issuecomment-23945310 .

kyleabeauchamp commented 10 years ago

I think this is done in #46