Closed jchodera closed 10 years ago
What's the advantage of using setuptools here? My worry is that the sad state of python packaging might indicate limits of what you can do using setuptools.
I don't think we've been using setuptools integration for anything--we've just been using nose.
setuptools
supports running tests in a tests/
subdirectory automatically via python setup.py test
, while distutils
does not. There's probably a better testing system we should be using, though. I know very little about good testing practices for Python other than doctest
and manually-run integration tests, but would like to move to fully automated testing at three levels:
What do you guys use?
Our tests aren't perfect, but we have some doctests, unit tests, and integration tests.
The best places to look are on MDTraj.
You can see our travis integration here:
https://github.com/rmcgibbo/mdtraj/blob/master/.travis.yml
The tests are here:
https://github.com/rmcgibbo/mdtraj/tree/master/MDTraj/tests
Here is the reference data:
https://github.com/rmcgibbo/mdtraj/tree/master/MDTraj/testing/reference
Also, my knowledge of testing is limited to following the templates that Robert set up, so I've never actually done a detailed comparison. In general, though, Robert chooses "industry standard" best practices.
Robert also likes to use a bunch of "helper" functions for equality comparison and reference data loading. I'm not sure if this is necessary or not, but you could probably steal some things like these:
from mdtraj.testing import skipif, get_fn, eq, slow
Thanks, @kyleabeauchamp! I've set up travis-ci to test the build/install process with pymbar
already (see the badge on the README.md
on the main page) but it currently just runs two random tests to see if they complete. I'll dig into the examples from MDTraj
to see if we can use a similar style for pymbar
and yank
.
This test suite is fairly straightforward, might be good as a model:
https://github.com/rmcgibbo/mdtraj/blob/master/MDTraj/tests/test_topology.py
One more thing that might be of interest: the python-netcdf4 people just uploaded a new (broken) version, which recently killed our MDTraj tests. If you run into this issue, see https://github.com/rmcgibbo/mdtraj/issues/150
Here are my two cents, since I saw this linked on rmcgibbo/mdtraj#150
setup.py test
. mostly because of lots of other fixes and improvements, including versioned installs, the ability to use setup.py develop
during development, the install_requires
keyword to setup, and others. Don't use numpy.distutils
. It is crap and @msultan and I have found multiple bugs.numpy.testing
is your friend. Functions like np.testing.assert_array_almost_equal
are much clearer than using something link assert np.linalg.norm(x,y) < 1e-8
. the np.testing
methods also give nice informative error messages when they break about what percent of the arrays are different, etc.nose
is great. using unittest.TestCase directly is lots of extra boilerplate. With nose, your tests can just be functions defined at the module level like test_propertyX()
that run some asserts. If you want to get fancy with setup and teardowns, you can, but it's often not needed.Also, I would recommend switching your docstrings into numpydoc format. That way, you can pretty easily use the sphinx autodoc tools to make pretty good api documentation for free. https://github.com/numpy/numpydoc
@jchodera @mrshirts Any thoughts on Robert's comments here? My vote is to start reformatting our tests to look like MDTraj.
This change will likely cause a couple of days of pain, but in the long run I think it could be really great to have all our MD python packages behaving similarly from the perspectives of both users and developers.
I am fully in favor of this! The travis-CI scripts should be updated to perform automated testing. And, of course, we need to rework all the tests to indicate whether they succeeded or failed.
On Sep 5, 2013, at 10:32 PM, kyleabeauchamp notifications@github.com wrote:
@jchodera @mrshirts Any thoughts on Robert's comments here? My vote is to start reformatting our tests to look match MDTraj.
This change will likely cause a couple of days of pain, but in the long run I think it could be really great to have all our MD python packages behaving similarly from the perspectives of both users and developers.
— Reply to this email directly or view it on GitHub.
Based on the other conversations, I think that revising the tests is a great next step.
In terms of testing functionality, the harmonic-oscillators.py has simple, quick tests of everything. For simple success/failure, then it's simply a matter of fixing the random seed so that the results are reproduced.
It's still useful to run with different random number seeds, since you can check to make sure that the errors stay within 1-2 sigma each time. The AUTOMATED version of this check of distributions is intended to be harmonic-oscillators-distributions.py, but that's not as complete.
The current code is very robust, if a bit ugly and slow (you should have seen it before!) and I think that is more valuable than a speed up. In very few cases is the speed a limiting factor now.
The best way to make sure that rewrites don't break the robustness is to ensure that all of the failure modes are tested somewhere that required the current approaches. So I think that there is significant downside to a complete rewrite until we're sure we can detect when new algorithms are likely to fail for some numerical stability / overflow / underflow reason.
Besides the quick tests, there need to be larger tests to handle all the robustness problems we encounter. This will almost certainly require storing the data, because many of them can't be simply regenerated with harmonic oscillators. Let me know where to commit them and I will, though my time is about 2x overcommitted, and before Kyle started looking at this, I hadn't scheduled significant time to work on this until Nov/Dec.
On Fri, Sep 6, 2013 at 3:55 AM, John Chodera notifications@github.comwrote:
I am fully in favor of this! The travis-CI scripts should be updated to perform automated testing. And, of course, we need to rework all the tests to indicate whether they succeeded or failed.
On Sep 5, 2013, at 10:32 PM, kyleabeauchamp notifications@github.com wrote:
@jchodera @mrshirts Any thoughts on Robert's comments here? My vote is to start reformatting our tests to look match MDTraj.
This change will likely cause a couple of days of pain, but in the long run I think it could be really great to have all our MD python packages behaving similarly from the perspectives of both users and developers.
— Reply to this email directly or view it on GitHub.
— Reply to this email directly or view it on GitHubhttps://github.com/choderalab/pymbar/issues/6#issuecomment-23924565 .
In terms of testing functionality, the harmonic-oscillators.py has simple, quick tests of everything. For simple success/failure, then it's simply a matter of fixing the random seed so that the results are reproduced.
I think we still need to add an actual test that computes whether all of the quantities are within the expected range (say, six sigma) of the analytical results.
Currently, they are all within 1-2 sigma, with roughly proper distribution, so a reproducibility test will get that.
Statistical tests should be statistical.
An automated test of statistical accuracy should test more like a P-value, rather than a strict sigma cutoff, and the harmonic-oscillator-distributions.py code is partly designed for this.
Strict sigma cutoffs are not a good idea: 6 sigma is probably worthless data, but lots of times things are greater than 2 sigma.
On Fri, Sep 6, 2013 at 9:22 AM, John Chodera notifications@github.comwrote:
In terms of testing functionality, the harmonic-oscillators.py has simple, quick tests of everything. For simple success/failure, then it's simply a matter of fixing the random seed so that the results are reproduced.
I think we still need to add an actual test that computes whether all of the quantities are within the expected range (say, six sigma) of the analytical results.
— Reply to this email directly or view it on GitHubhttps://github.com/choderalab/pymbar/issues/6#issuecomment-23939165 .
Statistical tests should be statistical.
I could not disagree more. There is absolutely no value in a test where test failure means that things might not be broken but have failed due to a statistical fluke. If a test fails, it should mean something is broken. That's why we should use a p-value sufficiently strict so that no failures are ever expected by chance in the entire lifetime of the project (potentially with the appropriate Bonferroni correction for the number of tests we run).
I could not disagree more.
I think we need to be more clear about what each other is saying before we decide if we disagree or not :)
By saying a statistical test should be statistical means that we have to run a very large number of repetitions -- we can't look at a single repetition. Certainly with a single repetition, if you test vs a strict sigma cutoff, then either it will be too strict (will fail frequently), or be worthless, because bad code/data will pass. That's what the harmonic-oscillators-distribution.py test is doing -- essentially it's doing q-q testing (that could be improved).
However, you don't want to or need to run a large number of repetitions if you are just checking if you break the code. For those tests, you want something deterministic test -- any change outside rounding errors is bad, in either the code or the analytical error. Statistical testing only needs to be done when the algorithm changes in a a way that is expected to change the answers. Code changes that don't change the expected answers don't need the extra statistical testing.
Note sure if I'm quite clear what I mean, but this is closer :)
On Fri, Sep 6, 2013 at 9:47 AM, John Chodera notifications@github.comwrote:
Statistical tests should be statistical.
I could not disagree more. There is absolutely no value in a test where test failure means that things might not be broken but have failed due to a statistical fluke. If a test fails, it should mean something is broken. That's why we should use a p-value sufficiently strict so that no failures are ever expected by chance in the entire lifetime of the project (potentially with the appropriate Bonferroni correction for the number of tests we run).
— Reply to this email directly or view it on GitHubhttps://github.com/choderalab/pymbar/issues/6#issuecomment-23940758 .
We can run one or many repetitions of a statistical test, but testing whether the results are within an expected tolerance is still inherently statistical.
Suppose we ran 100 replicates of a statistical test and used a p-value of 0.05 without correction: on average, we'd see five failures.
If we used a p-value of 0.05 with a Bonferroni correction, then there is still a 5% probability that there will be at least one failure.
My point is just that, no matter how many tests/replicates we run, we need to use a p-value that is incredibly unlikely to ever occur so that we know that when a test fails, something is wrong. If we expect to have less than 1e5 commits, then we can use a p-value of 1e-6 and correct this for the number of statistical tests evaluated during automated testing. That will ensure that failures are real failures.
So I think we might be over-engineering this a bit here. The principle "incisive tests with few false positives" sums up what we want, but we probably don't need to spend a lot of time designing the test p tests. Just having more tests is a major improvement...
What I'm trying to say is--we can worry more about this when we come to it.
Right, I think the two obvious steps are:
1) take the well tested, well-valided tests in harmonic-oscillators.py and make them easier to use, esp. in automated ways.
2) assemble a more rigorous stress test that we throw in essentially every test that people have emailed us where they broke MBAR (trying to keep each of those data sets down to a few MB). Make it easy to run these, but they don't need to be broken into unit tests yet -- we just need reproducibility. Once we have a framework set up, I'll dump tests in.
Then we start going crazy on code improvements.
On Fri, Sep 6, 2013 at 10:13 AM, kyleabeauchamp notifications@github.comwrote:
What I'm trying to say is--we can worry more about this when we come to it.
— Reply to this email directly or view it on GitHubhttps://github.com/choderalab/pymbar/issues/6#issuecomment-23942497 .
Yeah, I'm thinking more about the false negatives. That's what really can kill things. I'd rather have something flagged as being potentially bad than something bad not being flagged at all.
On Fri, Sep 6, 2013 at 10:09 AM, John Chodera notifications@github.comwrote:
We can run one or many repetitions of a statistical test, but testing whether the results are within an expected tolerance is still inherently statistical.
Suppose we ran 100 replicates of a statistical test and used a p-value of 0.05 without correction: on average, we'd see five failures.
If we used a p-value of 0.05 with a Bonferroni correction, then there is still a 5% probability that there will be at least one failure.
My point is just that, no matter how many tests/replicates we run, we need to use a p-value that is incredibly unlikely to ever occur so that we know that when a test fails, something is wrong. If we expect to have less than 1e5 commits, then we can use a p-value of 1e-6 and correct this for the number of statistical tests evaluated during automated testing. That will ensure that failures are real failures.
— Reply to this email directly or view it on GitHubhttps://github.com/choderalab/pymbar/issues/6#issuecomment-23942278 .
2) assemble a more rigorous stress test that we throw in essentially every test that people have emailed us where they broke MBAR (trying to keep each of those data sets down to a few MB). Make it easy to run these, but they don't need to be broken into unit tests yet -- we just need reproducibility. Once we have a framework set up, I'll dump tests in.
I'd prefer to keep the main distribution under 1 MB if at all possible. Maybe we can keep the tests that involve datasets in another repository (e.g. pymbar-tests
)?
Yes, this is exactly what I was intending. It's somewhere in the thread there! This is intended to be a separate distribution. Normal users don't need to stress test. If you actually want to be developing pymbar, you'll be willing to download more data.
On Fri, Sep 6, 2013 at 10:52 AM, John Chodera notifications@github.comwrote:
2) assemble a more rigorous stress test that we throw in essentially every test that people have emailed us where they broke MBAR (trying to keep each of those data sets down to a few MB). Make it easy to run these, but they don't need to be broken into unit tests yet -- we just need reproducibility. Once we have a framework set up, I'll dump tests in.
I'd prefer to keep the main distribution under 1 MB if at all possible. Maybe we can keep the tests that involve datasets in another repository (e.g. pymbar-tests)?
— Reply to this email directly or view it on GitHubhttps://github.com/choderalab/pymbar/issues/6#issuecomment-23945310 .
I think this is done in #46
We can more easily automate tests via setup.py with "python setup.py test" using setuptools.