TestStatus: Still would like clearer distinction between test comparisons and baseline comparisons

billsacks commented 8 years ago

Inspired by @gold2718 's comment here:

https://github.com/ESMCI/cime/pull/392#issuecomment-239626725

where he suggests knowing ahead of time which test phases are expected: I realized that it would still be good to have a clearer distinction made between history file comparisons done as an integral part of a test (e.g., comparing .base with .rest files for an ERS test) vs. baseline comparisons. Even though these are both about comparing history files, they are fundamentally different in their meaning.

cime4 test reporting made a clear distinction between these, as:

--- Test Functionality  ---:
PASS ERS_D_Ld10.T31_g37_gl5.IGHISTCLM45.yellowstone_intel.clm-glcMEC_decrease.clm2.h0.nc : test compare clm2.h0 (.base and .rest files)
<snip>
--- Baseline Comparison ---:
PASS ERS_D_Ld10.T31_g37_gl5.IGHISTCLM45.yellowstone_intel.clm-glcMEC_decrease.cism.h.nc : baseline compare cism.h (baseline: compare .base file with clm4_5_9_r185 file)

and I feel it would be good to maintain that clear distinction in cime5.

@jgfouca made some steps towards this, by adding suffixes to the single 'COMPARE' phase. But it seems like it would be good to explicitly split this into two phases - a TEST_COMPARE phase and a BASELINE_COMPARE phase - both in the code and in the test reporting. In the test reporting, this will make it easier to do greps to include / exclude particular aspects of the tests. In the code, this will make it easier to add new functionality, such as the functionality proposed by @gold2718 .

jgfouca commented 8 years ago

@billsacks the handling of COMPARE phases in CIME is already awkward and having another phase like COMPARE is not something I'm excited about. If you can describe a specific need/capability that would require this change, would you add a comment hear describing it? I'm going to tentatively close this for now. Feel free to reopen if you think we should reconsider.

rljacob commented 8 years ago

We do need to distinguish between "compare with baseline fail" and "compare within a test fail". The latter is much more serious. In fact, the test should just stop and not bother with baseline compare if the primary test compare failed.

jgfouca commented 8 years ago

@rljacob I have a different PR going in soon that will mark a test with an "internal" diff (non-baseline) as FAIL instead of DIFF.

billsacks commented 8 years ago

I agree with @rljacob here: These phases need to be more clearly distinguished. I'm going to reopen this issue. Yes, they both do "comparisons", but they are doing fundamentally different things, and I feel it's important to clearly distinguish between them in both the code and the test reporting, rather than treating them as flavors of the same thing. Distinguishing between these is just as important as distinguishing between (say) a RUN failure and a BUILD failure.

Maybe I'm missing something, though: I don't understand what difficulty it would cause to have two clearly-distinguished phases, something like BASELINE_COMPARE and TEST_COMPARE.

If you have too much on your plate and don't feel this is at a high level of priority, I can understand that. But I would like to keep this issue open until it is resolved.

jgfouca commented 8 years ago

@billsacks I just realized that a test has at most one baseline compare phase. That makes this easier than I thought.

rljacob commented 5 years ago

Reopening this because I still don't think there is enough distinction between test comparison and baseline comparisons in the TestLog output.

Example: A recent PR caused a PET test to fail in our nightly testing. It was reported as FAIL instead of diff on the dashboard, which is good, but looking at the TestStatus.log which is posted to our cdash board is still a mess. https://my.cdash.org/testDetails.php?test=48339687&build=1673112

As the PR author noted "this looks like a normal baseline diff". And he's right. Here's an abbreviated summary from the log: 2019-07-04 03:19:35: NLCOMP (compare with baselines) 2019-07-04 05:55:00: Comparing hists (compare first run with baselines) FAIL 2019-07-04 05:56:52: Comparing hists (the "base" and "single threaded" compare) FAIL 2019-07-04 05:56:59: Comparing hists (compare second run with baselines) FAIL

The NLCOMP and 2 baseline compares should not be done at all if the PET test itself fails. If the test bases, the baselines compares should should all appear AFTER the PET compares in the log.

rljacob commented 5 years ago

Also the 2 hist compares with baselines are, I assume, comparing each of the two runs in the PET test with their corresponding baselines but that also isn't clear in the output.

billsacks commented 5 years ago

@rljacob part of the mess you're seeing may be due to #2916 . I believe the intent is to only do a single baseline comparison at the end, but there's a bug (in multisubmit tests?) that causes this to be done twice.

That said, I won't argue with the general request for additional cleanup here - I just want to make sure that a distinction is made between general cleanup that's needed for all tests, and problems that are really due to #2916 .

github-actions[bot] commented 1 year ago

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days.

github-actions[bot] commented 1 year ago

This issue was closed because it has been stalled for 5 days with no activity.

rljacob commented 1 year ago

As someone who looks at TestStatus.log output a lot (on our cdash page), would still like this.

ESMCI / cime

TestStatus: Still would like clearer distinction between test comparisons and baseline comparisons #396