Closed mt5555 closed 2 years ago
@rljacob : Is @mkstratos the correct POC for these tests?
@mt5555 I believe I am, my apologies for not acknowledging, but I'm working on this right now. This will have to be fixed in evv4esm (and maybe CIME). Yes, it seems that right now, it fails for anything other than b4b, because of this: https://github.com/LIVVkit/evv4esm/blob/master/evv4esm/extensions/tsc.py#L286. It also is evaluating between 10s - 600s not 300s-600s, and rejecting if there are any points below the threshold (which is set here as 0.005). The simple fix I think would be to change the .any()
to .all()
, but is that the more conservative approach? And is the P_THRESHOLD = 0.005
correct?
thanks for the update! This isn't critical - I just wanted to make sure it hadn't gotten lost.
I'd defer to @huiwanpnnl for the threshold. From the plots, I thought it was 0.8. The the label says that is a %, so maybe that's a 0.008 (similar to .005 you mentioned).
I agree - using an "any()" would be the more conservative approach. but if I understood Hui correctly, the original design was an all() - all points in that range have to fail for the test to fail, and as long as there was at least one point above the line, the test would pass.
@mkstratos and @mt5555 this is likely a NCL -> Python translation error by me. Looking back at @huiwanpnnl's original NCL code, I believe it should be an .all()
:
l_overall_fail = all(nfail(its:ite).ge.1)
overall_passfail = where( l_overall_fail, "FAIL", "PASS")
And my .any()
in the Python goes back to when I first translated it.
Caveat, I'm not 100% sure I've connected the correct parts of the different code as it's been 3 years since I've looked at this.
Also, in the NCL code I have, P_THRESHOLD
is 0.005
or 0.5%
-- it may be in the last few years @huiwanpnnl has found 0.008
to be a better threshold value.
@jhkennedy : thanks for taking a look and the feedback. I took a careful look at the plot, which is showing 0.005. So I dont know where I got the 0.008 from, so we should stick with 0.005.
@mkstratos : do we need a PR to get this into e3sm?
In this case, all of the code changes are in evv4esm, so it will only take a PR there and an update to the cime conda environment to be deployed to the supported machines
Just a note to say that I'm following this discussion. As soon as @mkstratos has a new evv4esm
and cime-env
PR, I'll be happy to deploy it.
@mkstratos have you been able to test this? I can test it with some of my test cases, but I'll probably wait until the cime conda is updated.
This (and another nbfb test) have been failing for many days (weeks?) now on Chrysalis. It looks like a python error. I'm going to make another issue for that.
@mt5555 I have been able to test, but only with some manufactured data. The new conda environment is probably easiest, but evv4esm can be installed on its own. The PR for the fix is here: https://github.com/LIVVkit/evv4esm/pull/9
@mt5555 This should be solved with the newest EVV, now in the cime environment, and ready for you to test
I just got a chance to test this, with the last E3SM master. It looks like the new version is still incorrect, and actually worse than before (since now all cases seem to pass, no matter how different).
Here are some result from experiments seeing if TSC can detect a parameter change.
clubb_c_k10h = 0.35 (default): Test: PASS pmin plot is as expected (results are BFB)
Then we increasee the parameter to 0.351. test also is a PASS. Plot looks good, data in interval [5min,10min] are all above the threshold:
Increase to 0.36: test is a PASS. But all points in [5min,10min] are below the threshold, and this shoudl be a FAIL.
increase to 0.38, test is PASS. this should be a FAIL
Apologies @mt5555, I'm not sure why this isn't working. Is the model output saved somewhere that I could test?
I've been running on Anvil.
Here's the run:
/lcrc/group/acme/ac.mtaylor/scratch/anvil/TSC_P36x1.ne4_ne4.F2010-CICE.anvil_intel.C.20220629_230543_sksrqr
Link to EVV output for a run with 0.351, which is a PASS (as it should be)
Thank you @mt5555 for posting the pmin plots. @mkstratos, the 2nd plot posted above (for 0.351) also has a small issue: all the dots and lines above the dashed line are supposed to be in green. The other 3 plots look good. Perhaps this can provide an additional clue?
Here's the output of the 0.38 run, which is a PASS, but I think should be FAIL
data on Anvil: /lcrc/group/acme/ac.mtaylor/scratch/anvil/TSC_P36x1.ne4_ne4.F2010-CICE.anvil_intel.C.20220701_164649_xq4n97
Agree with @mt5555. I consider this to be an unambiguous "FAIL" case. All plots look good. Only the test status shown in the table at the beginning of the "Results" section looks incorrect.
@mkstratos : just checking on on this - any progress or thoughts on fixes?
I'm still investigating, but I think the test is performing "correctly" despite the figures. The p-value figures show the minimum p-value of all variables over both ocean and land, but overall null hypothesis rejection happens iff all the variables at all times in the window reject the null hypothesis (p-value < 0.005). So if even one variable has an "Accept" in the window, it won't effect the minimum p-value, but means the test passes.
The line here: https://github.com/LIVVkit/evv4esm/blob/master/evv4esm/extensions/tsc.py#L289
domains = (
null_hypothesis
.query(' seconds >= @args.inspect_times[0] & seconds <= @args.inspect_times[-1]')
.applymap(lambda x: x == 'Reject').any().transform(
lambda x: 'Fail' if x is True else 'Pass'
)
)
could be switched back from .any() to .all(), and overall rejection would happen if any variable fails.
The issue with the plotting is separate; all points after the first p-value below the threshold are plotted in red. I can change this so only points below the threshold are plotted in red, and those above are in green. I can also add plots of the p-value for each variable, which will show where the passes/failures are.
Based on the TSC paper, I believe that test is supposed to fail if p-value < 0.005 for any variable:
That's from page 544, where j is the index over different variables. link: https://gmd.copernicus.org/articles/10/537/2017/
@huiwanpnnl , can you confirm?
The TSC test is still failing but for a different reason. https://github.com/E3SM-Project/E3SM/issues/5072 .
@mkstratos , can you go ahead and make this change ASAP? it's clear from the TSC paper and the results presented in these CLUBB parameter tests that the correct thing to do is FAIL if any point within the time interval is below the threshold. Since pmin is the minimum over all points, this mains FAIL if any point in the pmin plot within the time interval is below the threshold.
The python code is quite difficult (for me) to parse. So I wont comment if this can be obtained by just changing "any()" to an "all()"!
Sorry for not having responded earlier (due to COVID then some urgent paperwork).
I won't comment on the python code either as I'm not good at the language. Instead let me provide some clarification on the intended algorithm.
The test is designed to assess PASS/FAIL at 2 levels:
Hope this helps.
@huiwanpnnl Thanks, your response is very helpful in clearing up how to think about this! I think I understand what we're trying to do, in the example below, this would be a passing test, because part of the interval is above p_min, correct? Current version does this:
domains = (
# "Reject" for rejection of null_hypothesis for each variable at each time, "Accept" if not
ttest.applymap(lambda x: 'Reject' if x[1] < args.p_threshold else 'Accept')
# Select only times in the inspection window
.query(' seconds >= @args.inspect_times[0] & seconds <= @args.inspect_times[-1]')
# If any timestep has any variables "Reject", then the domain [global, lnd, ocn] fails
.applymap(lambda x: x == 'Reject').any().transform(
lambda x: 'Fail' if x is True else 'Pass'
)
)
resulting in:
delta_l2_global Fail
delta_l2_land Fail
delta_l2_ocean Fail
The new version would be this
domains = (
# True for rejection of null_hypothesis for each variable at each time
ttest.applymap(lambda x: x[1] < args.p_threshold)
# Select only times in the inspection window
.query(' seconds >= @args.inspect_times[0] & seconds <= @args.inspect_times[-1]')
# Create groups of all variables at each time step in the window
.groupby("seconds")
# Are any variables failing at each time step in the inspection window?
.any()
# Then check if all time steps in the inspection window have a failing variable
.all()
# If _all_ time steps are failing then the domain [glob, lnd, ocn] is failing
.transform(lambda x: "Fail" if x else "Pass")
)
which results in
delta_l2_global Pass
delta_l2_land Pass
delta_l2_ocean Pass
@mkstratos, is the plot showing a hypothetical case or is it from a real simulation?
Yes, given how the test was designed in the 2017 paper, this would be assigned an overall PASS, because only parts of the time steps between (5min, 10min) failed.
I must say I had not seen examples like this in my work. If this is a real case, then I would be interested in knowing what happened in this simulation to cause a persistent FAIL in the first ~400 seconds and persistent PASS later on.
@huiwanpnnl This is a real simulation, it's the F2010-CICE
compset on ne4_oQU240
, with clubb_c_k10h = 0.36
The full results of this are here: https://web.lcrc.anl.gov/public/e3sm/ac.mkelleher/evv/TSC_clubb_c_k10h_0p36
I've done the updates to EVV and re-run the test for clubb_c_k10h = [0.351, 0.36, 0.38]
and the PASS / FAILS are as expected, I think: https://web.lcrc.anl.gov/public/e3sm/ac.mkelleher/evv/TSC_clubb_test (that is [PASS, FAIL, FAIL]). If that all looks good to you and @mt5555, I'll merge those changes and do a release for EVV
Looks great! I'd be happy with this version. But one question for @huiwanpnnl , since you wrote " So, ideally we would like to have something between "ANY" and "ALL", but I didn't investigate that further. "
In the proposed code, @mkstratos is using an all() here (applied to times, not variables), meaning all times have to fail for the test to FAIL. Based on what you wrote above, until we investigate this futher, wouldn't be better to go with the more consevative any(), meaning if any times fail, the test is a FAIL?
@mkstratos and @mt5555, this is a very interesting simulation. Based on the pmin plot and the full results @mkstratos pointed us to, I think this is a case in which the revised parameter clubb_c_k10h = 0.36
was initially "felt" by the test when the solution was still relatively linear and deterministic (hence the persistent FAIL in the first 400 seconds). Later on, nonlinearities, feedback mechanisms, or chaos introduced strong noise in the solution so the signal is lost (hence the persistent PASS afterwards). This does make sense to me. I had not seen examples like this in the past in my own work, but I believe similar behaviors have been seen by colleagues at ECMWF for a deep convection related parameter change.
@mt5555, regarding ANY versus ALL:
I used to think ANY would be a bit too conservative, because by definition, we could have a single-time-step FAIL just by chance. The probability should be small, though.
Phil Rasch once suggested to fail a test if the majority of the time steps in (5min, 10min) failed, but we also felt the definition of "majority" would become a somewhat subjective matter.
This case of clubb_c_k10h = 0.36
we are seeing here makes me think in these borderline cases, the best we can do, probably, is to inspect the full results and perhaps also the MVK and PGN results, and then decide whether we want to bless the simulation.
So, after all these back and forth in my own mind, I'd support changing to the conservative "FAIL if ANY time step in (5min, 10min) fails". (Or, if we want to reduce the risk of false positive, use "ANY 2"? Just a thought, and I will not insist on that :-) )
I like "ANY 2" - seems like a nice compromise for the situation where a single time fails due to some fluke. That might force @mkstratos to break up that python code into something more procedural and easier for the rest of us to parse :-)
Wait, are you saying this chain of DataFrame methods isn't perfectly understandable to .all()
?? :-)
The new code looks like this:
domains = (
# True for rejection of null_hypothesis for each variable at each time, by comparing
# index 1 (x[1]) of each column of tuples, which corresponds to the p-value to the
# threshold for p-values
ttest.applymap(lambda x: x[1] < args.p_threshold)
# Select only times in the inspection window
.query(' seconds >= @args.inspect_times[0] & seconds <= @args.inspect_times[-1]')
# Create groups of all variables at each time step in the window
.groupby("seconds")
# Are any variables failing at each time step in the inspection window?
.any()
# Since True -> 1 False -> 0, .sum() gets the number of timesteps for which
# the null hypothesis is rejected
.sum()
# If two or more time steps are failing then the domain [glob, lnd, ocn] is failing
.transform(lambda x: "Fail" if x >= 2 else "Pass")
)
which results in this output: https://web.lcrc.anl.gov/public/e3sm/ac.mkelleher/evv/TSC_clubb_test/
(i,e. all FAIL except for clubb_c_k10h = 0.351
)
Ah, this is way of coding is called the DataFrame methods. Good to know. And I must agree with @mt5555's prediction that @mkstratos's last code snippet, although still using the DataFrame methods I suppose, is a lot easier to follow - at least for me :-)
And it's great that we (you actually 👍) finally got this resolved and we managed to find a compromise that we all like. Thanks!
Reopening (accidentally auto-closed) pending further testing
can I get an ETA on this fix?
The new cime environment has been deployed (https://github.com/E3SM-Project/e3sm-unified/pull/105#issuecomment-1195867861) on Anvil, Badger, Chrysalis, Cooley and Cori, so it should be fixed, unless you need to run this somewhere else, or find any issues in running the test.
great, thanks!
The TSC NBFB test, as implemented in E3SM, is using a very restrictive PASS criterion, and it appears that any non-BFB change will fail. For details, see:
https://acme-climate.atlassian.net/wiki/spaces/DOC/pages/3209166849/NBFB+test+case+study
Test created @huiwanpnnl 's original design was that the test should PASS unless all points in the interval [5min,10min] in the P_min(%) plot are below the ~0.8 fail threshold. The current PASS/FAIL criterion is not using this metric.
How about a more conservative PASS criterion, which would be that the test should PASS if no points in the in the interval [5min,10min] in the P_min(%) plot are below the ~0.8 fail threshold.
Pinging @mkstratos and @salilmahajan as owners of the task to impliment the PASS/FAIL criterion in the the test.