ESCOMP / CTSM

Community Terrestrial Systems Model (includes the Community Land Model of CESM)
http://www.cesm.ucar.edu/models/cesm2.0/land/
Other
302 stars 307 forks source link

Going to a CISM active test for a processor change test (PEM), causes answers to change... #2542

Open ekluzek opened 4 months ago

ekluzek commented 4 months ago

Brief summary of bug

With ctsm5.2.0 we discovered we didn't have enough testing that corresponded to CESM or CAM testing. CESM testing is always done with CISM active, so I changed some tests in #2501 from I1850Clm60BgcCrop to I1850Clm60BgcCropG. However,

General bug information

CTSM version you are using: ctsm5.2.004-31-ga09d22376

Does this bug cause significantly incorrect results in the model's science? No

Configurations affected: With CISM active

Details of bug

PEM_D_Ld9.ne30pg3_t232.I1850Clm60BgcCrop.derecho_intel.clm-clm60cam6LndTuningMode passes, however PEM_D_Ld9.ne30pg3_t232.I1850Clm60BgcCropG.derecho_intel.clm-clm60cam6LndTuningMode fails in the comparison of different processors...

FAIL PEM_D_Ld9.ne30pg3_t232.I1850Clm60BgcCropG.derecho_intel.clm-clm60cam6LndTuningMode COMPARE_base_modpes

Important details of your setup / configuration so we can reproduce the bug

In the test list there are PEM and ERP tests for glc* testmods that have a comment that says this

cism is not answer preserving across processor changes, but short test length should be ok

Those tests range from 5 days to 10 days. But, many are f10, and the highest resolution is f19 which runs 5 days.

ekluzek commented 4 months ago

Still fails for 3 days, which is about the shortest I think we should try...

ekluzek commented 4 months ago

I talked to @Katetc about this after the CSEG meeting. She also said that the issue is a traditional global-sum issue in MPI which is solved in other places and as such should be relatively easy to fix.

In confirming the timeline on this she sent me an email, which says that they will work on this relatively soon.

samsrabin commented 3 months ago

On ctsm5.2.005, I'm getting a failure in the same step for

PEM_D_Ld9.ne30pg3_t232.I1850Clm60BgcCropG.derecho_intel.clm-clm60cam6LndTuningMode

Should this be marked as an expected fail? I see that a slightly different test (3 days instead of 9) named

PEM_D_Ld3.ne30pg3_t232.I1850Clm60BgcCropG.derecho_intel.clm-clm60cam6LndTuningMode

is present in the expected fail list (and points to this issue), but that's not actually in the test list.

ekluzek commented 3 months ago

On ctsm5.2.005, I'm getting a failure in the same step for

PEM_D_Ld9.ne30pg3_t232.I1850Clm60BgcCropG.derecho_intel.clm-clm60cam6LndTuningMode

Should this be marked as an expected fail? I see that a slightly different test (3 days instead of 9) named

PEM_D_Ld3.ne30pg3_t232.I1850Clm60BgcCropG.derecho_intel.clm-clm60cam6LndTuningMode

is present in the expected fail list (and points to this issue), but that's not actually in the test list.

Yes we should correct the expected fail to the test list. I think @slevis-lmwg did this in 006 though.

ekluzek commented 2 months ago

I ran into this again in working on ctsm5.2.009 because of a change in the test mod used.

But, I verified that in ctsm5.2.008 the following test fails:

PEM_D_Ld9.ne30pg3_t232.I1850Clm60BgcCropG.derecho_intel

ekluzek commented 2 months ago

See this comment: https://github.com/ESCOMP/CTSM/pull/2632#issuecomment-2217988993