Improving Proteus Testing Framework

alistairbntl commented 8 years ago

I'm opening this issue to start a forum for discussing the Proteus testing module.

In general there are a few ways I think the testing framework can be improved.

The use of classes with setups and tear downs.
Using attributes so that different configurations of tests can be easily run. In particular, I think tests should be flagged with tags like - fast, slow, stokes, mesh etc. This would allow us to run things like make test mesh to run all the tests related to meshes quickly while developing and then at the end of the day we could run make test all to test the entire suite.
Incorporating some of the air-water-vv tests into the Proteus main test directory. The only real issue with this is the size of some of these problems are a bit too large to be run with the Travis build on github.
This is a bigger picture but some kind of dedicate server that would allow us to consistently run some of the large tests associated with the air-water-vv etc. whenever new features are added.

I've incorporated some of these ideas (notably 1-2) in PR #411 if you're are interested in my current thinking.

Thoughts?

nerisaurus commented 8 years ago

So we use nose, right? Documentation for 1.3.7 is here, for people wanting to look at that: http://nose.readthedocs.io/en/latest/testing.html

Are we using any plugins with it that should be mentioned (especially if someone wants to bring in a different, potentially noncompatable one)?

Also, not to be that person... but should we be worried that nose is no longer being supported? Converting sounds like a pain (although actually it looks like several of the newer frameworks at least claim to be backwards compatible to nose tests - with pytest being particularly noted for this).

So I think a major question to start with the attributes is the question of what is -fast and -slow? Where do we think we can draw those lines (and should there be more divisions in between that? This puts three categories - fast, slow, and neither - would be be valuable to anyone to have more than one kind of "fast" definition? Depending on how the air-water-vv tests turn out, I could see requesting more gradation on the Slow side of things, to differentiate between the completely-impossible-for-travis-and-unpleasant-to-manually-run and the will-take-literal-days problems.)?

The other attributes seem a lot more flexible to be determined ad-hoc. A few categories to lump together automated sets of testing, and then a tag whenever a certain component/problem gets a lot of tests). Maybe it could be nice to have documentation to list all current tags, though - that way, people can add them on lazily, but (as long as they all update the documentation) not duplicate?

adimako commented 8 years ago

@alistairbntl @nerisaurus I have not used nose in the WaveTools tests, exactly for this reason. However I do appreciate that it has more options than unittest module and there are currently parts of the code that use it. Regarding tests, if I was to put a number on fast / slow I would that the line is at ~10 mins for this type of tests (dambreak, hydraulic structures etc). However, when I set up the versioning procedure at HR Wallingford's cluster (svn & Jenkins) we had 4 groups of tests, ranked with increasing time (No1 would take 30 mins to complete all tests, No4 two days) and decreasing frequency of execution (e.g No1 would run daily, No4 would run once a month or every two months). I have seen this happening in other codes as well, e.g. Imperial's fluidity or Telemac, and I believe it is a good approach

nerisaurus commented 8 years ago

@adimako Do you know how long the current travis tests take? I'd guess about 20-30 minutes (for each of the two?) total from the few times I've looked at them, but I've never watched them through or looked into their details. 4 groups sounds good, though - a series of quick modular tests for people to run every few commits as a sanity check on what they're particularly working on, a longer set for travis tests, and then longer ones to be run regularly or on specific milestones of release.

adimako commented 8 years ago

@nerisaurus my impression is a bit less than that, but that seems about right. I believe most of this is downloading and compiling the code, the tests themselves should be much less

alistairbntl commented 8 years ago

@adimako @nerisaurus - As it stands now the Travis tests are pretty quick, but if we start adding larger problems (like those from air-water-vv) and adding depth to the testing suite we will use up the time available on Travis without much problem.
Thus, I like the general framework suggested above by @adimako and @nerisaurus above of having three testing suites run at different intervals. This might lend itself to a hierarchy like fast <\approx 10 seconds, 10seconds < medium < 10 minutes, 10 minutes < slow < 1hour. Realistically if something takes much more than an hour to run I think it will be hard to include in a regular testing cycle. Other issues we need to consider is (a) finding server resources to run the larger test suites and (b) incorporating the entire build process into the test framework. I think @cekees will have insights to share about this.

As for using nosetests, @nerisaurus the lack on ongoing support has been a concern to me as well. I'm new to Python's testing tools so I have no major attachment to nose and am more than open to other frameworks. I like the setup / teardown options and I think the attribute plug-in is very valuable too but I don't see why other suites can't match this functionality.
@adimako - you mentioned that you have not used nosetest in the WaveTools tests. Do you have any thoughts on a possible alternative? @cekees - what are your thoughts on the nosetests?

nerisaurus commented 8 years ago

I've heard good things about pytest in my brief look around, including claims that it can completely run tests set up with nose with very little changes needed (it also has a way of setting attributes, although the syntax of this might differ a bit - and it definitely has setup/teardown). More on that here.

Are there any situations where we might have a problem categorized in a "faster" section due to vital-ness? That is to say, if we took @alistairbntl 's setup, a couple minutes length tests which are important and general enough, as well as efficiently set up and torn down with other small tests, to set them as "fast". I'm worried about the potential need for multiple measurements of how fast something is (where one is pure speed, and others are value/time units based on various different requirements for value).

adimako commented 8 years ago

@alistairbntl This is a bit messy test set but it helps me sleep at night https://github.com/erdc-cm/proteus/blob/master/proteus/tests/test_wavetools.py It is using the built-in python unittest module.

So first you import the module import unittest

Then you define each class of tests like this class TestAuxFunctions(unittest.TestCase)

Each function in the class that is a test function needs to start with test and refer to the class e.g. testVDir(self): and by adding these lines at the end if __name__ == '__main__': unittest.main(verbosity=2) It will run all functions starting with test as soon as you run it as a python script

cekees commented 8 years ago

Sorry for being late to the party on this one. I think pytest looks really good. Last time I thought about this issue I don't remember it being so complete. I think we should switch to pytest as well as keep supporting the use of the standard unittest and doctest modules (i.e. you don't have to use pytest functionality specifically). There is already a version of pytest in hashstack so we should just need to add it to the proteus profiles.

cekees commented 8 years ago

I like the the idea of using attributes to classify the tests. For the foreseeable future we'll have to have something like small for travis or any of the related lightweight, remotely hosted CI tools. On the other end I would like to have an attribute for tests that require one or possibly two nodes (e.g. 32-64 cores) that reproduce important benchmark results and would only be run at most weekly or possibly only manually triggered. In between there I could see room for two or three more groups aimed at nightly testing via buildbot buildslaves across a full set of architectures and probably a 1-2 hour test set that's just a backup test set on the 1/2 hour travis testing. How about fast, slow, and overnight for speed attributes, and small, medium, large, and xxl for memory/processor requirements.

erdc / proteus

Improving Proteus Testing Framework #414