Test cases should themselves be tested to ensure they pick up failures

billsacks commented 8 years ago

This issue is prompted by my finding two bugs in utils/python/CIME/SystemTests/lii.py, which both acted to make this test always pass, even if there were problems with the system under test that should make the test fail.

In talking with @mvertens , it sounds like the new python-based tests have generally been tested to ensure that they pass when they should, but not all of them have been tested to ensure that they fail when they should - i.e., that they actually pick up the sorts of problems they are designed to catch. @mvertens feels that the core tests (ERS, ERP and a few others) have been extensively vetted, but that other tests like LII, NCK and others have undergone less careful scrutiny.

Ideally we would have automated tests that test the test scripts, making sure that they fail when they should. But it could be hard to construct a test like this. So, lacking that, I'd be more comfortable if I knew that there had at least been manual testing done on each non-trivial test type to ensure that it indeed fails when it should.

Since this may be a gradual process, I'd suggest using this issue to keep track of what testing has been done. We can close this issue once all non-trivial tests have been tested in this way.

jedwards4b commented 8 years ago

To the extent that this is done it is done in the scripts_regression_tests script. It is important to distinguish tests that can (and should) be added as extensions to this script and what tests rely on the existence of external components that may or may not be present. These tests might go into a CIME_MODEL specific extension of the tests.

billsacks commented 8 years ago

I have done manual testing of the ERS and ERP tests from cime5.0.2. The ERS test fails when it should, but the ERP test does not: see issue #295.

billsacks commented 8 years ago

Brainstorming a bit about how we could automate the testing of these high-level system test scripts.... From a quick look, I don't see an easy way to put unit tests around them as is done in some other places in scripts_regression_tests. It feels like, to be confident that they're working right, you need to force a failure in the system under test (i.e., CESM or ACME), and verify that the test correctly reports the failure.

One way to do that would be to put namelist-controlled hooks in one or more model components that trigger them to deliberately act incorrectly with respect to some test (e.g., to deliberately fail exact restart, as I forced here: https://github.com/ESMCI/cime/issues/295 , or to deliberately fail an LII test, as I forced here: https://github.com/CESM-Development/cime/pull/145). But it seems confusing and dangerous to embed that kind of capability in the production code - i.e., a flag that deliberately causes it to act incorrectly. At the very least, I'd want some additional namelist flag like, 'allow_options_for_test_only', which defaults to false, and if false doesn't allow these various other flags to be set. But even with that, it muddies up the production code to have options like this.

Alternatively, @jedwards4b suggested to me the idea of having SourceMods that force these failures, which can reside in a testmods directory. That avoids muddying the production code, but has a higher maintenance cost, since the SourceMods will need to be updated whenever the production code is updated.

Yet another option would be to build a little component in cime that's just used for this purpose. That component would have namelist-controlled ways to trigger failures upon restart (e.g., by not writing something out to its restart file), failures upon changes in PE count (e.g., by putting in some multiplier that depends on PE number), etc. That could take some up-front work to set up, but in principle could make these tests testable. But it would be hard or impossible to make this work for component-specific tests like LII and others that are desired.

Yet another option would be to not try to test the test scripts in an end-to-end way like this, but instead to have small tests of each piece of a given test script. e.g., for the ERP test: You could have one test where you put in place two different netcdf files and confirm that the test reports a failure (i.e., making sure it's correctly doing the final comparison). You could have another test where you ensure that the PE layouts are getting set up correctly by grepping for text in the log files. You could have another test that (somehow) ensures that the second run is correctly set up as a restart run. etc.

All of these options seem pretty complicated to me. Maybe I'm missing something that's obvious to other people. In the end, though, it may be easiest to document some manual testing that should be performed whenever each of the test scripts is changed significantly.

jedwards4b commented 8 years ago

It occurs to me that component specific tests such as the LII should go into the individual components cime_config directory perhaps in a SystemTests subdirectory. I agree that support for building and running two model configurations is used frequently and is a good candidate for an inherited class. I propose we create a branch in cime and clm to prototype these ideas starting with the LII test.

mvertens commented 8 years ago

I think these are all good ideas. I am happy that Bill has discovered the shortcomings with our testing verifications before cime5.0.x was heavily utilized in both CESM and ACME.

My preference would be to start and outline what the simplest manual tests would be to have each system test to be verified that it fails when expected to fail. I would then actually like to make sure that all of these manual tests are run on the latest cime5.0.x version as soon as possible to make sure that the tests that are going into both ACME and CESM report successes that are actually successes. At that point, I would start with doing what Jim as suggested for the LII test as the first prototype.

On Tue, Jul 26, 2016 at 6:03 PM, jedwards4b notifications@github.com wrote:

It occurs to me that component specific tests such as the LII should go into the individual components cime_config directory perhaps in a SystemTests subdirectory. I agree that support for building and running two model configurations is used frequently and is a good candidate for an inherited class. I propose we create a branch in cime and clm to prototype these ideas starting with the LII test.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ESMCI/cime/issues/290#issuecomment-235443082, or mute the thread https://github.com/notifications/unsubscribe-auth/AHlxE8_poEfS88okTi7oUNtHXV2VJcAeks5qZqA6gaJpZM4JUd-j .

rljacob commented 8 years ago

Could the dead models be used to force error conditions? No one but us uses those.

billsacks commented 8 years ago

Recent changes to the test infrastructure made by @jgfouca (and maybe @jedwards4b ?), together with PR #404, help make it so that, moving forward, individual tests can be quite simple. Thus, we can be pretty confident that specific tests built on top of this are working correctly as long as the underlying infrastructure is working correctly. I have added unit tests of the new system_tests_compare_two.py in #404 , and there are already some system tests of system_tests_common.py.

Thus, what I see as needed at this point is:

Adding more complete test coverage of system_tests_common.py. For example, it appears that _component_compare_test is untested (as evidenced by a recent issue). (I'm fine with waiting for that to be covered by tests until it is migrated into python, if that's easier.)
Migrating existing tests to use the new infrastructure, ideally making existing tests nearly as simple as lii.py in #404 . As suggested by @mvertens , I will open issues for this after #404 comes to master.

I'm removing myself as assignee on this. Do others feel this issue is worth keeping open? Personally, I'm fine with closing it at this point.

jedwards4b commented 8 years ago

I'm fine with closing it.

ESMCI / cime

Test cases should themselves be tested to ensure they pick up failures #290