etmc / tmLQCD

tmLQCD is a freely available software suite providing a set of tools to be used in lattice QCD simulations. This is mainly a HMC implementation (including PHMC and RHMC) for Wilson, Wilson Clover and Wilson twisted mass fermions and inverter for different versions of the Dirac operator. The code is fully parallelised and ships with optimisations for various modern architectures, such as commodity PC clusters and the Blue Gene family.
http://www.itkp.uni-bonn.de/~urbach/software.html
GNU General Public License v3.0
31 stars 47 forks source link

install some unit testing system #17

Open urbach opened 12 years ago

urbach commented 12 years ago

check cunit

kostrzewa commented 12 years ago

I will look into a number of options for this and document them here.

kostrzewa commented 12 years ago

I have looked at the CU, Check, Unity and CUnit unit testing frameworks as listed on Wikipedia.

I've come to the conclusion that CU is probably the best fit for our project because it does not require elaborate adjustments to the build system (such as a complete autotools build), it is very much standard C (a couple of macros) and it looks very simple to use.

I have created a sample unit test and a test for the _su3_assign macro in my unit-testing branch. Note that CU, being very simple, has no real automation tools. (Not that those would buy us much because we use so many macros)

Writing these tests is a tedious process but it will probably turn out to be beneficial, especially during architectural changes (such as a new SSE implementation or new BlueGene code) There are also some questions remaining when it comes to initializing data structures and defining compiler options (such as -DSSE) but I will look into these.

I have not been able yet to find a suitable framework for running full application tests with reference input and output (such as the suggested sample files for the different solvers) but I have not looked enough. I think that it might make sense to even roll our own with a few bash or python scripts because we have limited and specialized needs.

kostrzewa commented 12 years ago

@etmc I have written the first "real-world" unit test inspired by Albert's check for his su3 exponentiation routine in the smearing framework, although I only implemented the first test. If exposu3 were replaced, this unit test could be used to check that it still works correctly.

13:59 bartek@chronos ~/projects/tmLQCD.kost/build/tests $ ./test_su3   
 -> SU3_ALGEBRA [IN PROGESS]
    --> Running su3_assign...
    --> Running su3_expo_positivedet...
 -> SU3_ALGEBRA [DONE]

==================================================
|               |  failed  |  succeed  |  total  |
|------------------------------------------------|
| assertations: |       0  |        2  |      2  |
| tests:        |       0  |        2  |      2  |
| tests suites: |       0  |        1  |      1  |
==================================================

Makefile.tests could use some work to make it friendlier but it works and adding a test harness is not that much work. (3 lines)

I'm going to try to create a more elaborate test of something in the near future.

deuzeman commented 12 years ago

Bartek, would you feel comfortable merging your unit testing branch into the master already? It would be good to have some sample code, so that we can start adding unit tests of our own, even if it isn't perfect yet.

kostrzewa commented 12 years ago

Sure, I think the basic framework and build system additions are as ready as they will be. I was just not sure whether CU is really the framework we want to use because I don't know how well it will work for testing more involved modules (*). I will prepare a little documentation and a pull-request for tonight.

(*) Because of its simplicity it has a lot of limitations when it comes to setting up the test environment in a given state. Also, it doesn't have any support for mocking and stubbing, which are fancy words for putting in place dummy interfaces to functionality that a code might depend on, but that you don't want to include in a test for separation reasons. Finally, it makes use of fork which makes it difficult to use debuggers properly.

deuzeman commented 12 years ago

I understand your reservations with respect to CU, but it's certainly going to beat not having anything. If we're going to run into its limitations, that's probably going to be soon enough and then we'll know what to look for in a replacement. And it still shouldn't be too hard to port by then -- it's writing the tests themselves that takes most of the time and I imagine the changes there will be fairly minimal.

deuzeman commented 12 years ago

Now that we have CU available, I think the main problem is defining some framework for testing code involving lattice-wide operations. I see few ways around having some small lattice available, on which test calculations can be done.

Though such a configuration should be tiny to be of practical use, it needs to be large enough to still be a proper test of code. I think the smallest volume we can get away with is 4^4, but maybe it would be good to have 4^3x6 or 4^3x8 to avoid at least one accidental symmetry and have the possibility of calculating a slightly serious correlator.

One test case should always be the unit field, so that we can compare results to those for the free field case. Maybe some fixed random field is also useful, though I imagine it won't add much to having some 'real' configuration. For the first case, we can just write code to generate the field -- it's probably even available already. But for other cases, I suppose we upload some standard test configuration? Or maybe it is already available and I'm just overlooking it?

With this done, we probably want some routine to compare two (gauge-)fields numerically? Just looking at things like the plaquette term could work in practice, but a full check covers all the bases. But then we quickly run into the need for introducing additional binary files when adding tests, which could become rather messy quickly. Not sure what is best here...

Finally, is there any way in which we can run checks on the MPI code?

kostrzewa commented 12 years ago

Let me begin by saying that unit tests are not really the place to run calculations. You want a unit test to complete in milliseconds at most (so you can run them regularly, ideally after every change, and you can run thousands of them covering your whole codebase). Also, you ideally want to have every test to be independent, allocating it's own memory and so on, which will further lengthen this kind of testing. Having said that; we can of course misuse the idea for this type of integration testing.

As for numerical comparisons, CU already includes this in the form of the regression checking. (see README.unit-testing) Simply produce output of an array (or list) of floating point numbers in your test case (print to stdout or stderr) which represents the field and the check-regressions script will compare known-good output to whatever your test code produced, up to a configurable precision. We might have to adjust this for scientific notation though (not sure) One needs to what one wants to test between a code which completes all tests and still has regressions in output and code which fails tests indicating problems immediately.

I'm still looking into integration testing frameworks which would allow this type of integration tests to be run automatically on a regular basis, as unit tests are, at least by the dogma, not the place to do this sort of thing. They are meant to test the most basic functionality such as interfaces, "constructors and destructors", algebraic functions etc...

I don't have an answer on the MPI issue yet. So far I've found that some MPI implementations are OK with fork(), while most are not. This does not prevent us though from writing a test harness where only one node has assertions and all the other nodes simply provide the data structures and MPI exchange glue, much like we handle the print statements in the main codebase.

deuzeman commented 12 years ago

I think I disagree with you on what constitutes a unit test. As I see it, unit tests check isolated parts of the code from a purely functional point of view -- given an interface with specific input and output, does a certain part of the code behave as required? We have quite a lot of code implementing more or less complicated mathematical transformations and those mathematical properties actually form a very clean definition of an interface. There are simply no cases where their use in practice should ever see them deviate from those properties, so we really want to check consistently if this is what they do.

Simple cases, like SU(3) algebra, are straightforward to test and we should try to get those checks in place. But they're actually simple enough that we could probably spot and solve problems somehow even without unit tests. They are also largely static at this point, so writing additional tests is going to be less productive short term. Performing a step of APE smearing, on the other hand, can fail in much more subtle ways. Even if it uses SU(3) algebra internally and is therefore composite to some extent, it is still a well defined mathematical component of other operations whose functionality can be compromised independently. Not extending unit test coverage to this type of routines means that we're essentially ignoring some of the most serious potential sources of errors.

Perhaps we want to introduce two tiers of tests, to separate the slower lattice-wide calculations from tests that are more trivial to run. We can even think of them as integration tests. But I believe anything that is supposed to implement a well defined mathematical operation -- anything which can be described by its expected functionality alone -- should be tested for that functionality on a regular basis. I would say we just have to be as clever as possible in implementing those tests, to minimize the computational burden they entail.

Maybe the solution is, as Carsten at some point suggested, to have all tests run nightly at some central point. If we then publish those results somewhere, or maybe even have any failing tests posted to the mailing list, that should be enough to track any issues without having to worry about running time too much. Even if some form of computation is required, it's unlikely to take more than a couple of seconds per test on small lattices. So even a large suite of tests -- and we don't actually have such a suite yet -- is unlikely to take more than a couple of hours to run. That should be doable.