E3SM-Project / E3SM

Energy Exascale Earth System Model source code. NOTE: use "maint" branches for your work. Head of master is not validated.
https://docs.e3sm.org/E3SM
Other
354 stars 368 forks source link

Simulations failed due to unrealistic high soil temperature #5803

Open jingtao-lbl opened 1 year ago

jingtao-lbl commented 1 year ago

Got an error (below) due to unrealistic high soil temperature in May during the AD spinup mode. I'm using the latest version (fc9f903ccd) with pm-cpu.

138: lnd2atm_vars%t_soisno_grc(g, 1) is 401.120579771595 138: ENDRUN: 138: lnd2atm ERROR: lnd2atm_vars%t_soisno_grc > 400 Kelvin degree.ERROR in lnd2atm 138: Mod.F90 at line 468

This error occurred for simulations using both GSWP3v1 and CRUNCEP. I feel it is not related to climate forcing since this problem did not arise in previous versions of the model when using these forcings. Any recent changes to the model that might cause this problem?

image

jinyuntang commented 1 year ago

Qing Zhu experienced this problem in the recent bgc experiments. It is probably a legacy bug that occurs when the surface energy balance code was triggered into a corner that leads to a runaway divergence scenario.

On Tue, Jul 11, 2023 at 9:12 PM jingtao-lbl @.***> wrote:

Got an error (below) due to unrealistic high soil temperature in May during the AD spinup mode. I'm using the latest version (fc9f903 https://github.com/E3SM-Project/E3SM/commit/fc9f903ccd497204c699a2517cc0dbc0df83d54b) with pm-cpu.

138: lnd2atm_vars%t_soisno_grc(g, 1) is 401.120579771595 138: ENDRUN: 138: lnd2atm ERROR: lnd2atm_vars%t_soisno_grc > 400 Kelvin degree.ERROR in lnd2atm 138: Mod.F90 at line 468

This error occurred for simulations using both GSWP3v1 and CRUNCEP. I feel it is not related to climate forcing since this problem did not arise in previous versions of the model when using these forcings. Any recent changes to the model that might cause this problem?

[image: image] https://user-images.githubusercontent.com/48232131/252848847-00502ec2-7752-4784-a4ca-21548f5263c4.png

— Reply to this email directly, view it on GitHub https://github.com/E3SM-Project/E3SM/issues/5803, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACTQV3UYQ7QCCWBPC3ZNVXDXPYP2NANCNFSM6AAAAAA2G4OM3Y . You are receiving this because you are subscribed to this thread.Message ID: @.***>

-- Jinyun (He/him) Staff Scientist Lawrence Berkeley National Laboratory 1 Cyclotron Rd., MS 74R316C Berkeley, CA 94720 tel: 510 486-5792, fax: 510 486-7070

jingtao-lbl commented 1 year ago

Some updates regarding this issue. I have tested global runs with compset 1850_DATM%CRU_ELM%CNPECACNTBC (I1850CRUCNPECACNTBC) at both resolutions of r05_r05 and f19_g16, and the f19_g16 simulation ran normally without any problems, but the r05_r05 run stopped due to the high soil temperature issue.

Not customized domains for these two simulations, and everything (e.g., fsurdata, paramfile, etc.) is on default. Same forcing for the two simulations, and both use intel on pm-cpu. Model version is fc9f903ccd.

Also, the errors occur at different locations when repeating the same 0.5x0.5 deg simulations, as shown below. image image

jingtao-lbl commented 1 year ago

Recent tests show that global runs (no matter what compsets, including CNPECACNTBC, CNPRDCTCBC, and BGC-FATES) at 0.5x0.5 deg resolution all failed due to this high ground temperature problem with the current Master version. The grid cells that give this problem are different with different forcing (e.g., GSWP3v1 vs. CRUNCEP_qianFill). If using different NPROCS, these grid cells also appear in different locations.

However, the same simulations at other resolutions, e.g., 1.9x2.5 or 4x5, work fine.

@rljacob @bishtgautam @peterdschwartz @glemieux

rljacob commented 1 year ago

Does this only happen on pm-cpu? Might try the same case on another platform to see if its a compiler issue.

jingtao-lbl commented 1 year ago

Does this only happen on pm-cpu? Might try the same case on another platform to see if its a compiler issue.

Thank you! Yes, I only tested it on pm-cpu. Jess helped me test it with gnu and she also got the error. Will try it on LRC.

rljacob commented 1 year ago

I checked our testing and we do run tests at r05 but for 5 days or less and with only a few BGC options. You said a previous version ran fine. What version was it exactly? The git hash of the code that ran would be best.

rljacob commented 1 year ago

These tests are passing fine:

jingtao-lbl commented 1 year ago

Oh great to know these tests are passing fine! How long did the test run last? I usually got the error after a couple of months, and Jess said the error popped out immediately for her simulation (when using FATES). Would you mind sharing the script for SMS_Ld2.ne30pg2_r05_EC30to60E2r2.BGCEXP_CNTL_CNPECACNT1850..elm-bgcexp here? Thank you so much!

rljacob commented 1 year ago

By default the tests run for 5 days but "Ld2" in the above means run for 2 days. We don't use run scripts for the tests. Everything is done with a single "create_test" command. Go to E3SM/cime/scripts and type "./create_test SMS_Ld2.ne30pg2_r05_EC30to60E2r2.BGCEXP_CNTL_CNPECACNT_1850.pm-cpu_intel.elm-bgcexp" Replace the machine, compiler string as needed.

jingtao-lbl commented 1 year ago

I checked our testing and we do run tests at r05 but for 5 days or less and with only a few BGC options. You said a previous version ran fine. What version was it exactly? The git hash of the code that ran would be best.

I found my version is quite old, and I used to run it on Cori without any problem. But when I bring it on Perlmutter, there are some dependency problems during compiling...

jingtao-lbl commented 1 year ago

By default the tests run for 5 days but "Ld2" in the above means run for 2 days. We don't use run scripts for the tests. Everything is done with a single "create_test" command. Go to E3SM/cime/scripts and type "./create_test SMS_Ld2.ne30pg2_r05_EC30to60E2r2.BGCEXP_CNTL_CNPECACNT_1850.pm-cpu_intel.elm-bgcexp" Replace the machine, compiler string as needed.

Thank you! I will check the test run and see if it can pass if running a bit longer. Will keep you posted!