TSC NBFB test: PASS/FAIL criterion needs fixing

mt5555 commented 2 years ago

The TSC NBFB test, as implemented in E3SM, is using a very restrictive PASS criterion, and it appears that any non-BFB change will fail. For details, see:

https://acme-climate.atlassian.net/wiki/spaces/DOC/pages/3209166849/NBFB+test+case+study

Test created @huiwanpnnl 's original design was that the test should PASS unless all points in the interval [5min,10min] in the P_min(%) plot are below the ~0.8 fail threshold. The current PASS/FAIL criterion is not using this metric.

How about a more conservative PASS criterion, which would be that the test should PASS if no points in the in the interval [5min,10min] in the P_min(%) plot are below the ~0.8 fail threshold.

Pinging @mkstratos and @salilmahajan as owners of the task to impliment the PASS/FAIL criterion in the the test.

mt5555 commented 2 years ago

@rljacob : Is @mkstratos the correct POC for these tests?

mkstratos commented 2 years ago

@mt5555 I believe I am, my apologies for not acknowledging, but I'm working on this right now. This will have to be fixed in evv4esm (and maybe CIME). Yes, it seems that right now, it fails for anything other than b4b, because of this: https://github.com/LIVVkit/evv4esm/blob/master/evv4esm/extensions/tsc.py#L286. It also is evaluating between 10s - 600s not 300s-600s, and rejecting if there are any points below the threshold (which is set here as 0.005). The simple fix I think would be to change the .any() to .all(), but is that the more conservative approach? And is the P_THRESHOLD = 0.005 correct?

mt5555 commented 2 years ago

thanks for the update! This isn't critical - I just wanted to make sure it hadn't gotten lost.

I'd defer to @huiwanpnnl for the threshold. From the plots, I thought it was 0.8. The the label says that is a %, so maybe that's a 0.008 (similar to .005 you mentioned).

I agree - using an "any()" would be the more conservative approach. but if I understood Hui correctly, the original design was an all() - all points in that range have to fail for the test to fail, and as long as there was at least one point above the line, the test would pass.

jhkennedy commented 2 years ago

@mkstratos and @mt5555 this is likely a NCL -> Python translation error by me. Looking back at @huiwanpnnl's original NCL code, I believe it should be an .all():

     l_overall_fail   =  all(nfail(its:ite).ge.1)
     overall_passfail = where( l_overall_fail, "FAIL", "PASS")

And my .any() in the Python goes back to when I first translated it.

Caveat, I'm not 100% sure I've connected the correct parts of the different code as it's been 3 years since I've looked at this.

jhkennedy commented 2 years ago

Also, in the NCL code I have, P_THRESHOLD is 0.005 or 0.5% -- it may be in the last few years @huiwanpnnl has found 0.008 to be a better threshold value.

mt5555 commented 2 years ago

@jhkennedy : thanks for taking a look and the feedback. I took a careful look at the plot, which is showing 0.005. So I dont know where I got the 0.008 from, so we should stick with 0.005.

@mkstratos : do we need a PR to get this into e3sm?

mkstratos commented 2 years ago

In this case, all of the code changes are in evv4esm, so it will only take a PR there and an update to the cime conda environment to be deployed to the supported machines

xylar commented 2 years ago

Just a note to say that I'm following this discussion. As soon as @mkstratos has a new evv4esm and cime-env PR, I'll be happy to deploy it.

mt5555 commented 2 years ago

@mkstratos have you been able to test this? I can test it with some of my test cases, but I'll probably wait until the cime conda is updated.

This (and another nbfb test) have been failing for many days (weeks?) now on Chrysalis. It looks like a python error. I'm going to make another issue for that.

mkstratos commented 2 years ago

@mt5555 I have been able to test, but only with some manufactured data. The new conda environment is probably easiest, but evv4esm can be installed on its own. The PR for the fix is here: https://github.com/LIVVkit/evv4esm/pull/9

mkstratos commented 2 years ago

@mt5555 This should be solved with the newest EVV, now in the cime environment, and ready for you to test

mt5555 commented 2 years ago

I just got a chance to test this, with the last E3SM master. It looks like the new version is still incorrect, and actually worse than before (since now all cases seem to pass, no matter how different).

Here are some result from experiments seeing if TSC can detect a parameter change.

clubb_c_k10h = 0.35 (default): Test: PASS pmin plot is as expected (results are BFB)

pmin_0 35

Then we increasee the parameter to 0.351. test also is a PASS. Plot looks good, data in interval [5min,10min] are all above the threshold:

pmin_0 351

Increase to 0.36: test is a PASS. But all points in [5min,10min] are below the threshold, and this shoudl be a FAIL.

pmin_0 36

increase to 0.38, test is PASS. this should be a FAIL pmin_0 38

mkstratos commented 2 years ago

Apologies @mt5555, I'm not sure why this isn't working. Is the model output saved somewhere that I could test?

mt5555 commented 2 years ago

I've been running on Anvil.

Here's the run:

/lcrc/group/acme/ac.mtaylor/scratch/anvil/TSC_P36x1.ne4_ne4.F2010-CICE.anvil_intel.C.20220629_230543_sksrqr

Link to EVV output for a run with 0.351, which is a PASS (as it should be)

https://web.lcrc.anl.gov/public/e3sm/ac.mtaylor/evv/TSC_P36x1.ne4_ne4.F2010-CICE.anvil_intel.C.20220629_230543_sksrqr/validation/20220629_230543_sksrqr.html

huiwanpnnl commented 2 years ago

Thank you @mt5555 for posting the pmin plots. @mkstratos, the 2nd plot posted above (for 0.351) also has a small issue: all the dots and lines above the dashed line are supposed to be in green. The other 3 plots look good. Perhaps this can provide an additional clue?

mt5555 commented 2 years ago

Here's the output of the 0.38 run, which is a PASS, but I think should be FAIL

https://web.lcrc.anl.gov/public/e3sm/ac.mtaylor/evv/TSC_P36x1.ne4_ne4.F2010-CICE.anvil_intel.C.20220701_164649_xq4n97/validation/20220701_164649_xq4n97.html

data on Anvil: /lcrc/group/acme/ac.mtaylor/scratch/anvil/TSC_P36x1.ne4_ne4.F2010-CICE.anvil_intel.C.20220701_164649_xq4n97

huiwanpnnl commented 2 years ago

Agree with @mt5555. I consider this to be an unambiguous "FAIL" case. All plots look good. Only the test status shown in the table at the beginning of the "Results" section looks incorrect.

mt5555 commented 2 years ago

@mkstratos : just checking on on this - any progress or thoughts on fixes?

mkstratos commented 2 years ago

I'm still investigating, but I think the test is performing "correctly" despite the figures. The p-value figures show the minimum p-value of all variables over both ocean and land, but overall null hypothesis rejection happens iff all the variables at all times in the window reject the null hypothesis (p-value < 0.005). So if even one variable has an "Accept" in the window, it won't effect the minimum p-value, but means the test passes.

The line here: https://github.com/LIVVkit/evv4esm/blob/master/evv4esm/extensions/tsc.py#L289

        domains = (
            null_hypothesis
            .query(' seconds >= @args.inspect_times[0] & seconds <= @args.inspect_times[-1]')
            .applymap(lambda x: x == 'Reject').any().transform(
                lambda x: 'Fail' if x is True else 'Pass'
            )
        )

could be switched back from .any() to .all(), and overall rejection would happen if any variable fails.

The issue with the plotting is separate; all points after the first p-value below the threshold are plotted in red. I can change this so only points below the threshold are plotted in red, and those above are in green. I can also add plots of the p-value for each variable, which will show where the passes/failures are.

mt5555 commented 2 years ago

Based on the TSC paper, I believe that test is supposed to fail if p-value < 0.005 for any variable:

That's from page 544, where j is the index over different variables. link: https://gmd.copernicus.org/articles/10/537/2017/

@huiwanpnnl , can you confirm?

rljacob commented 2 years ago

The TSC test is still failing but for a different reason. https://github.com/E3SM-Project/E3SM/issues/5072 .

mt5555 commented 2 years ago

@mkstratos , can you go ahead and make this change ASAP? it's clear from the TSC paper and the results presented in these CLUBB parameter tests that the correct thing to do is FAIL if any point within the time interval is below the threshold. Since pmin is the minimum over all points, this mains FAIL if any point in the pmin plot within the time interval is below the threshold.

The python code is quite difficult (for me) to parse. So I wont comment if this can be obtained by just changing "any()" to an "all()"!

huiwanpnnl commented 2 years ago

Sorry for not having responded earlier (due to COVID then some urgent paperwork).

I won't comment on the python code either as I'm not good at the language. Instead let me provide some clarification on the intended algorithm.

The test is designed to assess PASS/FAIL at 2 levels:

At each single time step, if the null hypothesis is rejected for ANY of the variables either over land or over the ocean, then we issue an FAIL for that time step. This "ANY", as @mt5555 pointed out, is reflected by the use of pmin in Eq. (7) that Mark posted above. The use of "ANY" is necessary here, because in very short (few-minute) simulations, the solution error often does not have enough time to propagate to many prognostic variables.
After we have assessed PASS/FAIL for individual time steps using the above criterion, we try to issue an overall FASS/FAIL. Now, an overall FAIL is issued if we get FAIL at ALL time steps between (5min, 10min). The use of "ALL" here is conservative, which Mark pointed out in a conversation a while ago. We don't want to use "ANY" here, because even in a run that is expected to pass, there can occasionally be a one-time-step FAIL just by chance. So, ideally we would like to have something between "ANY" and "ALL", but I didn't investigate that further. The use of "ALL" for the overall PASS/FAIL assessment is based on the reasoning that, since we have been liberal by using "ANY" for each single time step, in case there are true, large differences in the solution, the "ANY" at each single time step should be able to catch the differences even if they are just seen in one variable over land or ocean; on top of that, the "ALL" used for the time period of (5min, 10min) should tell us the persistence of the differences (i.e., true, large differences are expected to persist).

Hope this helps.

mkstratos commented 2 years ago

@huiwanpnnl Thanks, your response is very helpful in clearing up how to think about this! I think I understand what we're trying to do, in the example below, this would be a passing test, because part of the interval is above p_min, correct? Current version does this:

domains = (
    # "Reject" for rejection of null_hypothesis for each variable at each time, "Accept" if not
    ttest.applymap(lambda x: 'Reject' if x[1] < args.p_threshold else 'Accept')
    # Select only times in the inspection window
    .query(' seconds >= @args.inspect_times[0] & seconds <= @args.inspect_times[-1]')
    # If any timestep has any variables "Reject", then the domain [global, lnd, ocn] fails
    .applymap(lambda x: x == 'Reject').any().transform(
        lambda x: 'Fail' if x is True else 'Pass'
    )
)

resulting in:

delta_l2_global         Fail
delta_l2_land           Fail
delta_l2_ocean          Fail

The new version would be this

domains = (
    # True for rejection of null_hypothesis for each variable at each time
    ttest.applymap(lambda x: x[1] < args.p_threshold)

    # Select only times in the inspection window
    .query(' seconds >= @args.inspect_times[0] & seconds <= @args.inspect_times[-1]')

    # Create groups of all variables at each time step in the window
    .groupby("seconds")

    # Are any variables failing at each time step in the inspection window?
    .any()

    # Then check if all time steps in the inspection window have a failing variable
    .all()

    # If _all_ time steps are failing then the domain [glob, lnd, ocn] is failing
    .transform(lambda x: "Fail" if x else "Pass")
)

which results in

delta_l2_global         Pass
delta_l2_land           Pass
delta_l2_ocean          Pass

huiwanpnnl commented 2 years ago

@mkstratos, is the plot showing a hypothetical case or is it from a real simulation?

Yes, given how the test was designed in the 2017 paper, this would be assigned an overall PASS, because only parts of the time steps between (5min, 10min) failed.

I must say I had not seen examples like this in my work. If this is a real case, then I would be interested in knowing what happened in this simulation to cause a persistent FAIL in the first ~400 seconds and persistent PASS later on.

mkstratos commented 2 years ago

@huiwanpnnl This is a real simulation, it's the F2010-CICE compset on ne4_oQU240, with clubb_c_k10h = 0.36 The full results of this are here: https://web.lcrc.anl.gov/public/e3sm/ac.mkelleher/evv/TSC_clubb_c_k10h_0p36

I've done the updates to EVV and re-run the test for clubb_c_k10h = [0.351, 0.36, 0.38] and the PASS / FAILS are as expected, I think: https://web.lcrc.anl.gov/public/e3sm/ac.mkelleher/evv/TSC_clubb_test (that is [PASS, FAIL, FAIL]). If that all looks good to you and @mt5555, I'll merge those changes and do a release for EVV

mt5555 commented 2 years ago

Looks great! I'd be happy with this version. But one question for @huiwanpnnl , since you wrote " So, ideally we would like to have something between "ANY" and "ALL", but I didn't investigate that further. "

In the proposed code, @mkstratos is using an all() here (applied to times, not variables), meaning all times have to fail for the test to FAIL. Based on what you wrote above, until we investigate this futher, wouldn't be better to go with the more consevative any(), meaning if any times fail, the test is a FAIL?

huiwanpnnl commented 2 years ago

@mkstratos and @mt5555, this is a very interesting simulation. Based on the pmin plot and the full results @mkstratos pointed us to, I think this is a case in which the revised parameter clubb_c_k10h = 0.36 was initially "felt" by the test when the solution was still relatively linear and deterministic (hence the persistent FAIL in the first 400 seconds). Later on, nonlinearities, feedback mechanisms, or chaos introduced strong noise in the solution so the signal is lost (hence the persistent PASS afterwards). This does make sense to me. I had not seen examples like this in the past in my own work, but I believe similar behaviors have been seen by colleagues at ECMWF for a deep convection related parameter change.

huiwanpnnl commented 2 years ago

@mt5555, regarding ANY versus ALL:

I used to think ANY would be a bit too conservative, because by definition, we could have a single-time-step FAIL just by chance. The probability should be small, though.

Phil Rasch once suggested to fail a test if the majority of the time steps in (5min, 10min) failed, but we also felt the definition of "majority" would become a somewhat subjective matter.

This case of clubb_c_k10h = 0.36 we are seeing here makes me think in these borderline cases, the best we can do, probably, is to inspect the full results and perhaps also the MVK and PGN results, and then decide whether we want to bless the simulation.

So, after all these back and forth in my own mind, I'd support changing to the conservative "FAIL if ANY time step in (5min, 10min) fails". (Or, if we want to reduce the risk of false positive, use "ANY 2"? Just a thought, and I will not insist on that :-) )

mt5555 commented 2 years ago

I like "ANY 2" - seems like a nice compromise for the situation where a single time fails due to some fluke. That might force @mkstratos to break up that python code into something more procedural and easier for the rest of us to parse :-)

mkstratos commented 2 years ago

Wait, are you saying this chain of DataFrame methods isn't perfectly understandable to .all()?? :-)

The new code looks like this:

domains = (
    # True for rejection of null_hypothesis for each variable at each time, by comparing
    # index 1 (x[1]) of each column of tuples, which corresponds to the p-value to the
    # threshold for p-values
    ttest.applymap(lambda x: x[1] < args.p_threshold)
    # Select only times in the inspection window
    .query(' seconds >= @args.inspect_times[0] & seconds <= @args.inspect_times[-1]')
    # Create groups of all variables at each time step in the window
    .groupby("seconds")
    # Are any variables failing at each time step in the inspection window?
    .any()
    # Since True -> 1 False -> 0, .sum() gets the number of timesteps for which
    # the null hypothesis is rejected
    .sum()
    # If two or more time steps are failing then the domain [glob, lnd, ocn] is failing
    .transform(lambda x: "Fail" if x >= 2 else "Pass")
)

which results in this output: https://web.lcrc.anl.gov/public/e3sm/ac.mkelleher/evv/TSC_clubb_test/ (i,e. all FAIL except for clubb_c_k10h = 0.351)

huiwanpnnl commented 2 years ago

Ah, this is way of coding is called the DataFrame methods. Good to know. And I must agree with @mt5555's prediction that @mkstratos's last code snippet, although still using the DataFrame methods I suppose, is a lot easier to follow - at least for me :-)

And it's great that we (you actually 👍) finally got this resolved and we managed to find a compromise that we all like. Thanks!

mkstratos commented 2 years ago

Reopening (accidentally auto-closed) pending further testing

mt5555 commented 2 years ago

can I get an ETA on this fix?

mkstratos commented 2 years ago

The new cime environment has been deployed (https://github.com/E3SM-Project/e3sm-unified/pull/105#issuecomment-1195867861) on Anvil, Badger, Chrysalis, Cooley and Cori, so it should be fixed, unless you need to run this somewhere else, or find any issues in running the test.

mt5555 commented 2 years ago

great, thanks!

E3SM-Project / E3SM

TSC NBFB test: PASS/FAIL criterion needs fixing #4759