Model crash at ne30 -- T above 500k (likely fixed with macmic=12)

golaz commented 1 year ago

My latest simulation with EAMxx at ne30 crashed a little over after one year. I have performed a number of ne30 simulations before, none of which failed. I've verified that the crash is reproducible.

The model fails because of excessively large temperature 514 K) in BC, Canada on April 1

From atm.log

Atmosphere step = 21839
  model time = 0002-03-31 23:30:00

[EAMxx::output_manager] - Writing model output:
[EAMxx::output_manager]      CASE: 20230607.F2010-SCREAMv1.ne30pg2_ne30pg2.chrysalis.scream.monthly
[EAMxx::output_manager]      FILE: 20230607.F2010-SCREAMv1.ne30pg2_ne30pg2.chrysalis.scream.monthly.AVERAGE.nmonths_x1.0002-03-01-00000.nc
Atmosphere step = 21840
  model time = 0002-04-01 00:00:00

From e3sm.log

 858: terminate called after throwing an instance of 'std::logic_error'
 858:   what():  /gpfs/fs1/home/ac.golaz/E3SM/EAMxx/code/20230607/components/eamxx/src/share/atm_process/atmosphere_process.cpp:442: FAIL:
 858: false
 858: Error! Failed post-condition property check (cannot be repaired).
 858:   - Atmosphere process name: shoc
 858:   - Property check name: T_mid within interval [100, 500]
 858:   - Atmosphere process MPI Rank: 858
 858:   - Message: Check failed.
 858:   - check name: T_mid within interval [100, 500]
 858:   - field id: T_mid[Physics PG2] <double:ncol,lev>(12,128) [K]
 858:   - minimum:
 858:     - value: 211.163
 858:     - entry: (20541,60)
 858:     - lat/lon: (56.7783, 238.776)
 858:   - maximum:
 858:     - value: 514.47
 858:     - entry: (20663,127)
 858:     - lat/lon: (54.7812, 232.505)

Simulation output is available on chrysalis: /lcrc/group/e3sm/ac.golaz/E3SM/EAMxx/20230607.F2010-SCREAMv1.ne30pg2_ne30pg2.chrysalis

Run script is attached: run_20230607.F2010-SCREAMv1.ne30.chrysalis.sh.txt

oksanaguba commented 1 year ago

i have been slowly running ne30 v1 128 levels on frontier, it is now in 8th year.

ndkeen commented 1 year ago

I have also been able to run multiple years of ne30 on pm-gpu and frontier. I don't see anything in your launch script that is doing anything different than what I'm trying. Will have to see if issue is with repo being used or the machine/compiler. I will try on chrysalis with todays scream.

It seems to be running ok. /lcrc/group/e3sm/ac.ndkeen/scratch/chrys/s12-jun14/t.F2010-SCREAMv1.ne30pg2_ne30pg2.chrysalis.orig is a case that uses todays scream, but otherwise same setup as yours and is at model date = 00020503

Looking at the hashes, I see the 2 runs diverge here:

<    0: bfbhash>          17820 d2cde00a18981e7b (Hommexx)
>    0: bfbhash>          17820 d3eb6bb8a5a2d62d (Hommexx)

which is day 371.25 (dividing step number by 48)

I verified the checkout I'm using on 6/14 (5c131ca120) has a quite a few diffs compared to the 6/7 repo of CG (5dc44fe07487fc83a1bb7c2ad5ce5f6203386b9f) If this is a case of something changed in the repo... potential PR's:

5c131ca120 2023-06-13 12:44:28 -0600 Autotester for E.. Merge Pull Request #2379 from E3SM-Project/scream/tcclevenger/update_compute_sanitizer_tests
afb57c8f82 2023-06-12 13:09:53 -0600 Autotester for E.. Merge Pull Request #2375 from E3SM-Project/scream/jgfouca/eamxx_buildnml_enhance
a38f7a5c99 2023-06-09 08:39:23 -0600 Autotester for E.. Merge Pull Request #2370 from E3SM-Project/scream/jeff-cohere/mam4-nucleation
6273dc31cf 2023-06-08 21:51:09 -0600 Luca Bertagna      Merge pull request #2371 from E3SM-Project/jgfouca/add_chrysalis_setup
6e98c22358 2023-06-08 17:37:45 -0600 Autotester for E.. Merge Pull Request #2369 from E3SM-Project/scream/tcclevenger/fix_cudamemcheck_nudging_fail
b88645876a 2023-06-08 15:53:10 -0600 Autotester for E.. Merge Pull Request #2367 from E3SM-Project/scream/bartgol/allow-coarsening-of-strided-vars
1084bdc66d 2023-06-08 10:22:41 -0600 Autotester for E.. Merge Pull Request #2365 from E3SM-Project/scream/tcclevenger/fix_initcheck_compute_sanitizer_fails
4156717c1d 2023-06-08 08:39:29 -0600 Autotester for E.. Merge Pull Request #2348 from E3SM-Project/scream/aarondonahue/force_sfc_export_from_file_v2

I also repeated the case using the same 6/7 repo of CG and see the same crash and hash differences.

ambrad commented 1 year ago

What is the PE layout used in this run? I think the default on Chrysalis is 2 threads/rank for testing. I believe there might be a threading issue in SHOC's interface. In any case, if you're not running already with 64x1, I recommend doing so.

ndkeen commented 1 year ago

These were both 1800x1

golaz commented 1 year ago

The divergence occurs after time step 17802 (model time = 0002-01-06 21:00:00) and before 17820 (model time = 0002-01-07 06:00:00). So this is during the second year and was reproduced in my rerun starting 0002-01-01.

bfbhash for my simulation (left) and Noel's (right):

Screenshot_2023-06-14_11-48-20

ambrad commented 1 year ago

@ndkeen we're a day or two away from getting the Hommexx diagnostic hasher (just merged to upstream E3SM this morning) into SCREAM to complement the EAMxx diagnostic hasher. These differ from the above hasher in that they emit hashes after every process in the AD and after every boundary exchange in the dycore. I recommend we wait for those to go in, then rerun your setup with instructions that I post to this issue to enable the diagnostic hashers.

gsever commented 1 year ago

Hello E3SM team,

I am running the SCREAM model on ALCF’s ThetaGPU. Recently, I have encountered an error using the ne30pg2 configuration with screami_ne30np4l128_ifs-20160801_20220909.nc file as initial conditions. The error reads as Bad dphi, dp3d, or vtheta_dp; label: 'DIRK Newton loop np1'; see hommexx.errlog.8.1

Issue #2084 mentions a similar problem (for ne120) and commented that the 128 vertical levels case is not suitable for low-resolution setups. (Despite successful attempts reported here).

Are you using the model default L128 init file: screami_ne30np4L128_20221004.nc here? If so, what is the actual initialization datetime of that file and which dataset it originates from? I can run the model fine (well ~ 7 months to be more accurate without wind related CFL error) with default L72 init file: screami_ne30np4L72_20220823.nc

I have access to LCRC, but not part of the e3sm group thus cannot see the details of the original run.

Thanks,

ndkeen commented 1 year ago

I checked out several hashes in between 6/7 and 6/14 and tried the same thing as above. All of them failed in the same way as described above. That includes using a hash that was the same as the one I had used to see a successful 2-year run. So I ran the same script in the same checkout again -- and it crashes as well. I went back to at least 5/24 and while I have to change the output as some we were using are new, the code crashes at same 1-year boundary. Pretty odd.

ndkeen commented 1 year ago

@gsever it might be better to make a new issue with your specific problem -- include the date of repo and launch script.

ndkeen commented 1 year ago

The cases I've been running are here: /lcrc/group/e3sm/ac.ndkeen/scratch/chrys

Here are the dirs:

drwxr-sr-x   3 ac.ndkeen E3SM    4096 Jun 15 11:02 b2023-04-20-PR2291-82f215a652/
drwxr-sr-x   3 ac.ndkeen E3SM    4096 Jun 15 14:35 b2023-04-24-PR2297-709c430968/
drwxr-sr-x   3 ac.ndkeen E3SM    4096 Jun 15 12:20 b2023-05-04-PR2309-de719afa37/
drwxr-sr-x   3 ac.ndkeen E3SM    4096 Jun 15 13:04 b2023-05-08-PR2321-06a178c79a/
drwxr-sr-x   3 ac.ndkeen E3SM    4096 Jun 14 22:42 b2023-05-24-PR2349-0895d60dbf/
drwxr-sr-x   5 ac.ndkeen E3SM    4096 Jun 15 00:09 b2023-06-01-PR2327-355a4370f1/
drwxr-sr-x   3 ac.ndkeen E3SM    4096 Jun 14 18:23 b2023-06-05-PR2358-d9602e9f16/
drwxr-sr-x   3 ac.ndkeen E3SM    4096 Jun 14 17:04 b2023-06-08-PR2348-4156717c1d/
drwxr-sr-x   3 ac.ndkeen E3SM    4096 Jun 14 18:12 b2023-06-08-PR2365-1084bdc66d/
drwxr-sr-x   3 ac.ndkeen E3SM    4096 Jun 14 17:35 b2023-06-08-PR2367-b88645876a/
drwxr-sr-x   3 ac.ndkeen E3SM    4096 Jun 14 19:18 b2023-06-08-PR2369-6e98c22358/
drwxr-sr-x   3 ac.ndkeen E3SM    4096 Jun 14 17:25 b2023-06-08-PR2371-6273dc31cf/
drwxr-sr-x   3 ac.ndkeen E3SM    4096 Jun 14 18:13 b2023-06-09-PR2370-a38f7a5c99/
drwxr-sr-x   3 ac.ndkeen E3SM    4096 Jun 14 19:59 b2023-06-12-PR2375-afb57c8f82/
drwxr-sr-x   6 ac.ndkeen E3SM    4096 Jun 15 10:59 b2023-06-13-PR2379-5c131ca120/

ambrad commented 1 year ago

Thanks @ndkeen for these runs. Here's a bit of analysis.

I ran

(for i in `ls|grep b2023`; do echo ">>> $i"; for j in `find $i -name e3sm.log\*`; do echo ">> $j"; zgrep bfbhash $j; done; done) > ~/tmp/bfb2023.txt

and then, once I saw the output,

grep ">>> \|>> \|bfbhash>\s*21834\|bfbhash>\s*21654\|bfbhash>\s*17820" bfb2023.txt

This procedure gives the following:

>>> b2023-04-20-PR2291-82f215a652
>> b2023-04-20-PR2291-82f215a652/t.F2010-SCREAMv1.ne30pg2_ne30pg2.chrysalis.orig/run/e3sm.log.343227.230615-123150.gz
>>> b2023-04-24-PR2297-709c430968
>>> b2023-05-04-PR2309-de719afa37
>> b2023-05-04-PR2309-de719afa37/t.F2010-SCREAMv1.ne30pg2_ne30pg2.chrysalis.orig/run/e3sm.log.343298.230615-131750
>>> b2023-05-08-PR2321-06a178c79a
>> b2023-05-08-PR2321-06a178c79a/t.F2010-SCREAMv1.ne30pg2_ne30pg2.chrysalis.orig/run/e3sm.log.343320.230615-134607
>>> b2023-05-24-PR2349-0895d60dbf
>> b2023-05-24-PR2349-0895d60dbf/t.F2010-SCREAMv1.ne30pg2_ne30pg2.chrysalis.orig.old/run/e3sm.log.342877.230614-232247
   0: bfbhash>          17820 d2cde00a18981e7b (Hommexx)
   0: bfbhash>          21654 3307b987a0346b95 (Hommexx)
   0: bfbhash>          21834 c0a417f93b6c9941 (Hommexx)
>>> b2023-06-01-PR2327-355a4370f1
>> b2023-06-01-PR2327-355a4370f1/t.F2010-SCREAMv1.ne30pg2_ne30pg2.chrysalis.orig/run/e3sm.log.342672.230614-204014
>> b2023-06-01-PR2327-355a4370f1/t.F2010-SCREAMv1.ne30pg2_ne30pg2.chrysalis.orig.old/run/e3sm.log.342951.230615-005246
>>> b2023-06-05-PR2358-d9602e9f16
>> b2023-06-05-PR2358-d9602e9f16/t.F2010-SCREAMv1.ne30pg2_ne30pg2.chrysalis.orig/run/e3sm.log.342668.230614-203845
>>> b2023-06-08-PR2348-4156717c1d
>> b2023-06-08-PR2348-4156717c1d/t.F2010-SCREAMv1.ne30pg2_ne30pg2.chrysalis.orig/run/e3sm.log.342603.230614-180418
   0: bfbhash>          17820 d2cde00a18981e7b (Hommexx)
   0: bfbhash>          21654 3307b987a0346b95 (Hommexx)
   0: bfbhash>          21834 c0a417f93b6c9941 (Hommexx)
>>> b2023-06-08-PR2365-1084bdc66d
>> b2023-06-08-PR2365-1084bdc66d/t.F2010-SCREAMv1.ne30pg2_ne30pg2.chrysalis.orig/run/e3sm.log.342599.230614-163546
   0: bfbhash>          17820 d2cde00a18981e7b (Hommexx)
   0: bfbhash>          21654 3307b987a0346b95 (Hommexx)
   0: bfbhash>          21834 c0a417f93b6c9941 (Hommexx)
>>> b2023-06-08-PR2367-b88645876a
>> b2023-06-08-PR2367-b88645876a/t.F2010-SCREAMv1.ne30pg2_ne30pg2.chrysalis.orig/run/e3sm.log.342633.230614-194215
   0: bfbhash>          17820 d2cde00a18981e7b (Hommexx)
   0: bfbhash>          21654 3307b987a0346b95 (Hommexx)
   0: bfbhash>          21834 c0a417f93b6c9941 (Hommexx)
>>> b2023-06-08-PR2369-6e98c22358
>> b2023-06-08-PR2369-6e98c22358/t.F2010-SCREAMv1.ne30pg2_ne30pg2.chrysalis.orig/run/e3sm.log.342740.230614-204314
   0: bfbhash>          17820 d2cde00a18981e7b (Hommexx)
   0: bfbhash>          21654 3307b987a0346b95 (Hommexx)
   0: bfbhash>          21834 c0a417f93b6c9941 (Hommexx)
>>> b2023-06-08-PR2371-6273dc31cf
>> b2023-06-08-PR2371-6273dc31cf/t.F2010-SCREAMv1.ne30pg2_ne30pg2.chrysalis.orig/run/e3sm.log.342630.230614-190045
   0: bfbhash>          17820 d2cde00a18981e7b (Hommexx)
   0: bfbhash>          21654 3307b987a0346b95 (Hommexx)
   0: bfbhash>          21834 c0a417f93b6c9941 (Hommexx)
>>> b2023-06-09-PR2370-a38f7a5c99
>> b2023-06-09-PR2370-a38f7a5c99/t.F2010-SCREAMv1.ne30pg2_ne30pg2.chrysalis.orig/run/e3sm.log.342595.230614-163244
   0: bfbhash>          17820 d2cde00a18981e7b (Hommexx)
   0: bfbhash>          21654 3307b987a0346b95 (Hommexx)
   0: bfbhash>          21834 c0a417f93b6c9941 (Hommexx)
>>> b2023-06-12-PR2375-afb57c8f82
>> b2023-06-12-PR2375-afb57c8f82/t.F2010-SCREAMv1.ne30pg2_ne30pg2.chrysalis.orig/run/e3sm.log.342607.230614-181045
   0: bfbhash>          17820 d2cde00a18981e7b (Hommexx)
   0: bfbhash>          21654 3307b987a0346b95 (Hommexx)
   0: bfbhash>          21834 c0a417f93b6c9941 (Hommexx)
>>> b2023-06-13-PR2379-5c131ca120
>> b2023-06-13-PR2379-5c131ca120/t.F2010-SCREAMv1.ne30pg2_ne30pg2.chrysalis.orig/run/e3sm.log.342666.230614-194845
   0: bfbhash>          17820 d2cde00a18981e7b (Hommexx)
   0: bfbhash>          21654 3307b987a0346b95 (Hommexx)
   0: bfbhash>          21834 c0a417f93b6c9941 (Hommexx)
>> b2023-06-13-PR2379-5c131ca120/t.F2010-SCREAMv1.ne30pg2_ne30pg2.chrysalis.orig.noyaml/run/e3sm.log.343226.230615-115317
   0: bfbhash>          17820 d2cde00a18981e7b (Hommexx)
   0: bfbhash>          21654 3307b987a0346b95 (Hommexx)
>> b2023-06-13-PR2379-5c131ca120/t.F2010-SCREAMv1.ne30pg2_ne30pg2.chrysalis.origb/run/e3sm.log.342913.230615-001146
   0: bfbhash>          17820 d2cde00a18981e7b (Hommexx)
   0: bfbhash>          21654 3307b987a0346b95 (Hommexx)
   0: bfbhash>          21834 c0a417f93b6c9941 (Hommexx)
>> b2023-06-13-PR2379-5c131ca120/t.F2010-SCREAMv1.ne30pg2_ne30pg2.chrysalis.origc/run/e3sm.log.343198.230615-111819
   0: bfbhash>          17820 d2cde00a18981e7b (Hommexx)
   0: bfbhash>          21654 3307b987a0346b95 (Hommexx)
   0: bfbhash>          21834 c0a417f93b6c9941 (Hommexx)

We see the following:

Every run that Noel has in these directories and that has bfbhash> output available (not true before mid-May) agrees at 17820 with Chris's run.
Every run agrees at 21834, right before the crash.

The preliminary conclusion is that the repo has for very roughly three weeks reliably reproduced the crash Chris reported. It seems the outlier is the run that Noel reports above in /lcrc/group/e3sm/ac.ndkeen/scratch/chrys/s12-jun14/t.F2010-SCREAMv1.ne30pg2_ne30pg2.chrysalis.orig, which diffs with Chris's run and the b2023 runs at time step 17820.

ndkeen commented 1 year ago

I can reproduce the crash on pm-cpu with intel.

Occurs much sooner in the simulation

   0: bfbhash>           2016 8ab4d9622375307b (Hommexx)
 814: terminate called after throwing an instance of 'std::logic_error'
 814:   what():  /global/cfs/cdirs/e3sm/ndk/repos/se68-jun15/components/eamxx/src/share/atm_process/atmosphere_process.cpp:442: FAIL:
 814: false
 814: Error! Failed post-condition property check (cannot be repaired).
 814:   - Atmosphere process name: shoc
 814:   - Property check name: T_mid within interval [100, 500]
 814:   - Atmosphere process MPI Rank: 814
 814:   - Message: Check failed.
 814:   - check name: T_mid within interval [100, 500]
 814:   - field id: T_mid[Physics PG2] <double:ncol,lev>(12,128) [K]
 814:   - minimum:
 814:     - value: 214.744
 814:     - entry: (18740,9)
 814:     - lat/lon: (53.03, 307.773)
 814:   - maximum:
 814:     - value: 515.048
 814:     - entry: (18736,127)
 814:     - lat/lon: (50.9276, 304.606)

/pscratch/sd/n/ndk/e3sm_scratch/pm-cpu/se68-jun15/t.F2010-SCREAMv1.ne30pg2_ne30pg2.pm-cpu.orig

On chrysalis, if we can believe that my runs with a given repo are reproducible, then I think I can narrow it down between:

runs 2 years with 
5/8  b2023-05-08-PR2321-06a178c79a

crashes with:
5/10 b2023-05-10-PR2193-270f4efc76

Which means it was this commit causing the issue: 270f4efc76 2023-05-10 09:44:15 -0700 AaronDonahue Merge pull request #2193 from E3SM-Project/oksanaguba/eamxx/wetdry

ambrad commented 1 year ago

The wet-dry change certainly modified answers, so the PR Noel isolated makes sense.

oksanaguba commented 1 year ago

i am not following this thread and it may not be relevant, but wet-dry had a bug in geopotential in p3 (found by Peter B.), so it would lead to unphysical tendencies.

ndkeen commented 1 year ago

Maybe 81a9193ed8 2023-06-06 05:56:54 -0700 AaronDonahue Merge pull request #2356 from E3SM-Project/bogensch/p3_dz_fix

ambrad commented 1 year ago

No, p3_dz_fix is for F90 only.

ambrad commented 1 year ago

i am not following this thread and it may not be relevant, but wet-dry had a bug in geopotential in p3 (found by Peter B.), so it would lead to unphysical tendencies.

Just to be clear, the issue arises after, not before, this PR. Not saying the PR has a bug, just clarifying.

ndkeen commented 1 year ago

I verified that I see the same issue on pm-cpu with intel -- ie a run using hash before the commit noted above runs OK, but with 270f4efc76 2023-05-10 09:44:15 -0700 AaronDonahue Merge pull request #2193 from E3SM-Project/oksanaguba/eamxx/wetdry it crashes with high T.

case with repo before PR 2193 (runs to at least model date =   00020928)
/pscratch/sd/n/ndk/e3sm_scratch/pm-cpu/b2023-05-08-PR2321-06a178c79a/t.intel.F2010-SCREAMv1.ne30pg2_ne30pg2.pm-cpu.n015

and case with PR 2193: (crashes at model date =   00010212)
/pscratch/sd/n/ndk/e3sm_scratch/pm-cpu/b2023-05-10-PR2193-270f4efc76/t.intel.F2010-SCREAMv1.ne30pg2_ne30pg2.pm-cpu.n015

https://github.com/E3SM-Project/scream/pull/2193

ndkeen commented 1 year ago

Peter C had suggested we get a restart closer to the crash, increase the level output, and run again. On pm-cpu, it crashes fairly quickly, but I did get a restart after 1 month here:

/pscratch/sd/n/ndk/e3sm_scratch/pm-cpu/se68-jun15/t.F2010-SCREAMv1.ne30pg2_ne30pg2.pm-cpu.orig

I just don't know what set of outputs to use that would be helpful.

@crterai first gave me some outputs to write more date hourly and I've run this on pm-cpu with intel here: /pscratch/sd/n/ndk/e3sm_scratch/pm-cpu/se68-jun15/t.intel.F2010-SCREAMv1.ne30pg2_ne30pg2.pm-cpu.n015.extraout00

And then Chris provided another set of changes to get hourly tendencies, which I've just submitted. This will be to add:

./atmchange physics::mac_aero_mic::shoc::compute_tendencies=T_mid,qv
./atmchange physics::mac_aero_mic::p3::compute_tendencies=T_mid,qv
./atmchange physics::rrtmgp::compute_tendencies=T_mid
./atmchange homme::compute_tendencies=T_mid,qv

and use:
/pscratch/sd/t/terai/EAMxx/shareWNoel/scream_output.hourly_tendency.20230622.yaml

That case has completed (fails as expected) here: /pscratch/sd/n/ndk/e3sm_scratch/pm-cpu/se68-jun15/t.intel.F2010-SCREAMv1.ne30pg2_ne30pg2.pm-cpu.n015.hourlyTend

@crterai

ndkeen commented 1 year ago

I noted above regarding finer output. I also wanted to note that changing the PE layout seems to be causing the case to behave differently. The original case on chrysalis reported by Chris G was 1800x1. On pm-cpu, I've been repeating that 1800x1 configuration which is using 15 nodes (at 128 MPI's per node).

Running with 8 nodes allowed me to run at least 2 years without issue on pm-cpu: /pscratch/sd/n/ndk/e3sm_scratch/pm-cpu/se68-jun15/t.F2010-SCREAMv1.ne30pg2_ne30pg2.pm-cpu.n008

Also with 1800x1, I ran with GNU compiler on pm-cpu and it does not crash at the same place (or has not crashed at all yet). This case has about 3 months: /pscratch/sd/n/ndk/e3sm_scratch/pm-cpu/se68-jun15/t.gnu.F2010-SCREAMv1.ne30pg2_ne30pg2.pm-cpu.n015.extraout00

And then running on frontier, I have run quite a while without issues. Here I have at least 2 years /lustre/orion/cli115/proj-shared/noel/e3sm_scratch/maf-jun19/t.maf-jun19.F2010-SCREAMv1.ne30pg2_ne30pg2.frontier-scream-gpu.n011t8x6.vth200.od I will try specifically 1800x1.

ambrad commented 1 year ago

Since you're now seeing what seems to be PE-layout-related non-BFBness, you could use the diagnostic-level hashing to determine the culprit:

atmchange --all internal_diagnostics_level=1 atmosphere_processes::internal_diagnostics_level=0

This line will produce a large number of hxxhash> lines (for Hommexx internals) in addition to exxhash> (for EAMxx AD processes) lines. In case you want to disable the Hommexx hashing, add an additional argument to the above to disable it:

./atmchange --all internal_diagnostics_level=1 atmosphere_processes::internal_diagnostics_level=0 ctl_nl::internal_diagnostics_level=0

In any case, you can grep an e3sm.log file with exxhash to get AD lines only, hxxhash for Homme only, or xxhash to get them all.

One warning about the hxxhash> lines: In the first few dynamics time steps of a run, some lines correspond to hashes of uninitialized memory. Thus, disregard hxxhash> diffs in the first physics time step. exxhash> lines, in contrast, should always be legitimate.

One other thing: The hxxhash> capability was enabled in the EAMxx YAML in a PR that was merged just this morning, so you'll need to update your repo if you want Hommexx internal diagnostics.

ndkeen commented 1 year ago

Hmm, OK. I ran 3 tests on pm-cpu for 2 months each:

1024 MPIs (completed)
/pscratch/sd/n/ndk/e3sm_scratch/pm-cpu/se68-jun15/t.intel.F2010-SCREAMv1.ne30pg2_ne30pg2.pm-cpu.n008.od.idl0

1800 MPIs
/pscratch/sd/n/ndk/e3sm_scratch/pm-cpu/se68-jun15/t.intel.F2010-SCREAMv1.ne30pg2_ne30pg2.pm-cpu.n015.od.idl0

2700 MPIs
/pscratch/sd/n/ndk/e3sm_scratch/pm-cpu/se68-jun15/t.intel.F2010-SCREAMv1.ne30pg2_ne30pg2.pm-cpu.n022.od.idl0

And then a similar set in /pscratch/sd/n/ndk/e3sm_scratch/pm-cpu/se69-jun22 with scream master of today. Unfortunately, with this newer repo, the 15-node 1800x1 case did not fail as it was before.

The 15-node 1880x1 case in the se68-jun15 (ie repo from June 15th) does fail as expected. The last lines of e3sm log are:

   0: exxhash>    1- 42.22222 1 eb35f7d9d3d61ea0 (spa-pst-sc-0)
   0: exxhash>    1- 42.22222 2                0 (spa-pst-sc-0)
   0: exxhash>    1- 42.22222 0 b811790fc51b787e (p3-pre-sc-0)
   0: exxhash>    1- 42.22222 0 fb6c4ee8e0ce9f6b (p3-pst-sc-0)
   0: exxhash>    1- 42.22222 1 4734c7a17fe701f1 (p3-pst-sc-0)
   0: exxhash>    1- 42.22222 2                0 (p3-pst-sc-0)
   0: exxhash>    1- 42.20833 0 23212f1181086835 (mac_aero_mic-pst-sc-4)
   0: exxhash>    1- 42.20833 1 3eac713e0fd651ae (mac_aero_mic-pst-sc-4)
   0: exxhash>    1- 42.20833 2                0 (mac_aero_mic-pst-sc-4)
   0: exxhash>    1- 42.20833 0 23212f1181086835 (mac_aero_mic-pre-sc-5)
   0: exxhash>    1- 42.22569 0 776cc1eba3c8dc86 (shoc-pre-sc-0)
   0: exxhash>    1- 42.22569 0 e17a00a93e56ca5c (shoc-pst-sc-0)
   0: exxhash>    1- 42.22569 1 896da6cd9ca9cc08 (shoc-pst-sc-0)
   0: exxhash>    1- 42.22569 2                0 (shoc-pst-sc-0)
 814: terminate called after throwing an instance of 'std::logic_error'

If I diff the 8-node output vs the (failed) 15-node output, there are diffs in the very beginning. However, when I diff the two completed jobs (8 and 22-node cases), they also seem to diff in the hashes at the very beginning. And I see AB said there might be some confusing lines in the first few dynamics steps, so will need to look more.

8-node:
   0: exxhash>    1-  0.00000 0                0 (SurfaceCouplingImporter-pre-sc-0)                                                                                                                                            
   0: exxhash>    1-  0.00000 0                0 (SurfaceCouplingImporter-pst-sc-0)                                                                                                                                            
   0: exxhash>    1-  0.00000 1 c0bf9100ad4c556e (SurfaceCouplingImporter-pst-sc-0)                                                                                                                                            
   0: exxhash>    1-  0.00000 2                0 (SurfaceCouplingImporter-pst-sc-0)                                                                                                                                            
   0: exxhash>    1-  0.00000 0 7057008f650423d3 (homme-pre-sc-0)                                                                                                                                                              
   0: bfbhash>              0 dc0bc57a44663ce8 (Hommexx)                                                                                                                                                                       
   0: exxhash>    1-  0.00000 0 bc78cf59ea9666f2 (homme-pst-sc-0)                                                                                                                                                              
   0: exxhash>    1-  0.00000 1 2d84a1ba48edd9f8 (homme-pst-sc-0)                                                                                                                                                              
   0: exxhash>    1-  0.00000 2 209bc791dd1ab2e7 (homme-pst-sc-0)                                                                                                                                                              
   0: exxhash>    1-  0.00000 0 2eebcd598b91fa57 (physics-pre-sc-0)                                                                                                                                                            
   0: exxhash>    1-  0.00000 0 d1f89357d42b8c8a (mac_aero_mic-pre-sc-0)                                                                                                                                                       
   0: exxhash>    1-  0.00000 0 162248995c86b1a9 (shoc-pre-sc-0)                                                                                                                                                               
   0: exxhash>    1-  0.00000 0 aff6844215bc8b79 (shoc-pst-sc-0)                                                                                                                                                               
   0: exxhash>    1-  0.00000 1 89a4dc20febe561b (shoc-pst-sc-0)                                                                                                                                                               
   0: exxhash>    1-  0.00000 2                0 (shoc-pst-sc-0)                                                                                                                                                               
   0: exxhash>    1-  0.00000 0 acdb31d4d6035074 (cld_fraction-pre-sc-0)                                                                                                                                                       
   0: exxhash>    1-  0.00000 0 acdb31d4d6035074 (cld_fraction-pst-sc-0)                                                                                                                                                       
   0: exxhash>    1-  0.00000 1 24cfc7c709712da7 (cld_fraction-pst-sc-0)                                                                                                                                                       
   0: exxhash>    1-  0.00000 2                0 (cld_fraction-pst-sc-0)               

vs the 15-node:

   0: exxhash>    1-  0.00000 0                0 (SurfaceCouplingImporter-pre-sc-0)                                                                                                                                            
   0: exxhash>    1-  0.00000 0                0 (SurfaceCouplingImporter-pst-sc-0)                                                                                                                                            
   0: exxhash>    1-  0.00000 1 c0bf9100ad4c556e (SurfaceCouplingImporter-pst-sc-0)                                                                                                                                            
   0: exxhash>    1-  0.00000 2                0 (SurfaceCouplingImporter-pst-sc-0)                                                                                                                                            
   0: exxhash>    1-  0.00000 0 7057008f650423d3 (homme-pre-sc-0)                                                                                                                                                              
   0: bfbhash>              0 dc0bc57a44663ce8 (Hommexx)                                                                                                                                                                     
   0: exxhash>    1-  0.00000 0 464808d2471b216a (homme-pst-sc-0)       ndk first line of diff                                                                                                                                                         
   0: exxhash>    1-  0.00000 1 b753db32a5729541 (homme-pst-sc-0)                                                                                                                                                              
   0: exxhash>    1-  0.00000 2  bdb7bacfc767a8b (homme-pst-sc-0)                                                                                                                                                              
   0: exxhash>    1-  0.00000 0 428a404a449bb3a9 (physics-pre-sc-0)                                                                                                                                                            
   0: exxhash>    1-  0.00000 0 e59706488d3545dc (mac_aero_mic-pre-sc-0)                                                                                                                                                       
   0: exxhash>    1-  0.00000 0   b32afd02036d57 (shoc-pre-sc-0)                                                                                                                                                               
   0: exxhash>    1-  0.00000 0 c5b808717e589c9e (shoc-pst-sc-0)                                                                                                                                                               
   0: exxhash>    1-  0.00000 1 9dfa6050675c4c69 (shoc-pst-sc-0)                                                                                                                                                               
   0: exxhash>    1-  0.00000 2                0 (shoc-pst-sc-0)                                                                                                                                                               
   0: exxhash>    1-  0.00000 0 c320d44aa9cfccf2 (cld_fraction-pre-sc-0)                                                                                                                                                       
   0: exxhash>    1-  0.00000 0 c320d44aa9cfccf2 (cld_fraction-pst-sc-0)                                                                                                                                                       
   0: exxhash>    1-  0.00000 1 24cfc7c70a7e311d (cld_fraction-pst-sc-0)                                                                                                                                                       
   0: exxhash>    1-  0.00000 2                0 (cld_fraction-pst-sc-0)

Looking at just the bfbhash lines in all 3 outputs, they all differ as early as 18th step.

ndkeen commented 1 year ago

As the scream checkout from yesterday is not crashing the same way as it did with June15th repo, I'm running the same 15-node 1800x1 case longer (without additional hash info). That case ran 2 years without a fail. /pscratch/sd/n/ndk/e3sm_scratch/pm-cpu/se69-jun22/t.intel.F2010-SCREAMv1.ne30pg2_ne30pg2.pm-cpu.n015.od I think on chrysalis, this case failed closer to 2 years in. Both of these are with Intel.

Note that scream repo must not have been upstream merged in a while as I don't have the Intel compiler changes in there and made them locally.

ambrad commented 1 year ago

The hxxhash lines are the problem in the beginning, but what you show is just the exxhash lines. Thus, the flagged line is a valid diff, isolated to the Hommexx AD process. We see the diff propagate subsequently to all the exxhash lines. But in this case it may not be Hommexx in particular, since this is all happening immediately after init. I wonder if there is PE-layout dependence in the initialization phase that our nightly ERP/PEM tests aren't capturing.

Edit: For those wanting to know more details about the hash lines we're discussing, see https://acme-climate.atlassian.net/wiki/spaces/NGDNA/pages/3831923056/EAMxx+BFB+hashing.

ndkeen commented 1 year ago

Just quickly, using cprnc on two cpl.hi files in 8-node and 22-node dirs (of June15th repo and the more recent June22nd repo), it does indicate differences. So we may have 2 issues here: non-bfb with different PE layout tests (should be caught with PEM test) and a crash with too high T for at least one PE layout.

Yep, this test fails: PEM_P1024x1_Ld1.ne30pg2_ne30pg2.F2010-SCREAMv1.pm-cpu_intel

So that's requesting a 1024-way run compared to a 512-way run, which will run on 8 nodes in the debug Q.

And using GNU passes: PEM_P1024x1_Ld1.ne30pg2_ne30pg2.F2010-SCREAMv1.pm-cpu_gnu

Same story with even smaller test (runs on 1 node):

PEM_P128x1_Ld1.ne30pg2_ne30pg2.F2010-SCREAMv1.pm-cpu_intel  fails compare
PEM_P128x1_Ld1.ne30pg2_ne30pg2.F2010-SCREAMv1.pm-cpu_gnu    passes

trying DEBUG:

PEM_D_P1024x1_Ld1.ne30pg2_ne30pg2.F2010-SCREAMv1.pm-cpu_intel  passes
PEM_D_P128x1_Ln6.ne30pg2_ne30pg2.F2010-SCREAMv1.pm-cpu_intel   passes

So we see PEM fail with intel/OPT build only on pm-cpu (and presumably with chrysalis as well)

Even get a fail with ne4 -- should have tried that sooner. PEM.ne4pg2_ne4pg2.F2010-SCREAMv1.pm-cpu_intel Will make a separate, cleaner issue https://github.com/E3SM-Project/scream/issues/2406

ambrad commented 1 year ago

Runs on Chrysalis show no diffs. Perhaps this is an issue isolated to pm-cpu Intel.

Script:

tests=""
for npe in 256 362 512 640; do   
    for compiler in gnu intel; do
        tests+=" PEM_P${npe}x1_Ld1.ne30pg2_ne30pg2.F2010-SCREAMv1.chrysalis_${compiler}"
    done
done
$e3sm/cime/scripts/create_test $tests --machine chrysalis --project $wcid -j 64

Results showing both PASS for each test and bfbhash comparison among tests:

$ ./cs.status.20230623_164037_6v0xpu | grep Overall; for compiler in gnu intel; do echo $compiler; for i in PEM_*${compiler}*; do zgrep bfbhash $i/run/e3sm.log* | tail -n 1; done; done
  PEM_P256x1_Ld1.ne30pg2_ne30pg2.F2010-SCREAMv1.chrysalis_gnu (Overall: PASS) details:
  PEM_P256x1_Ld1.ne30pg2_ne30pg2.F2010-SCREAMv1.chrysalis_intel (Overall: PASS) details:
  PEM_P362x1_Ld1.ne30pg2_ne30pg2.F2010-SCREAMv1.chrysalis_gnu (Overall: PASS) details:
  PEM_P362x1_Ld1.ne30pg2_ne30pg2.F2010-SCREAMv1.chrysalis_intel (Overall: PASS) details:
  PEM_P512x1_Ld1.ne30pg2_ne30pg2.F2010-SCREAMv1.chrysalis_gnu (Overall: PASS) details:
  PEM_P512x1_Ld1.ne30pg2_ne30pg2.F2010-SCREAMv1.chrysalis_intel (Overall: PASS) details:
  PEM_P640x1_Ld1.ne30pg2_ne30pg2.F2010-SCREAMv1.chrysalis_gnu (Overall: PASS) details:
  PEM_P640x1_Ld1.ne30pg2_ne30pg2.F2010-SCREAMv1.chrysalis_intel (Overall: PASS) details:
gnu
  0: bfbhash>             36 d228e1e60be6efd2 (Hommexx)
  0: bfbhash>             36 d228e1e60be6efd2 (Hommexx)
  0: bfbhash>             36 d228e1e60be6efd2 (Hommexx)
  0: bfbhash>             36 d228e1e60be6efd2 (Hommexx)
intel
  0: bfbhash>             36 ed625355c8369ff8 (Hommexx)
  0: bfbhash>             36 ed625355c8369ff8 (Hommexx)
  0: bfbhash>             36 ed625355c8369ff8 (Hommexx)
  0: bfbhash>             36 ed625355c8369ff8 (Hommexx)

ndkeen commented 1 year ago

I verified that I get the same crash (T above 500k) on chrysalis if I run with a diff PE layout (2700x1 instead of the above 1800x1). I also see the same hash values in e3sm log. /lcrc/group/e3sm/ac.ndkeen/scratch/chrys/s12-jun14/t.F2010-SCREAMv1.ne30pg2_ne30pg2.chrysalis.p2700x1

I also tried a case where I decreased fortran compiler flag opt level from O3 to O2. I actually see the same results -- hashes are identical. /lcrc/group/e3sm/ac.ndkeen/scratch/chrys/s12-jun14/t.F2010-SCREAMv1.ne30pg2_ne30pg2.chrysalis.p2700x1.noO3

ambrad commented 1 year ago

@ndkeen is it fair to summarize things as follows?

The model crash with T>500K is legitimate.
Chrysalis has no evident non-BFBness.
In the course of investigating this issue, it was observed that pm-cpu/ifx has non-BFBness, Frontier had non-BFBness that is now (probably) resolved, and neither of these -- since they are very likely related to platform-specific compilers -- is relevant to this issue.

ndkeen commented 1 year ago

Note that pm-cpu/intel is actually using

login40% ftn --version
ifort (IFORT) 2021.9.0 20230302

I decided to also run the same case with GNU on chyrsalis. Surprisingly, that also failed with high T. At model date = 00020423, which is about 22 days beyond where it happens with Intel on this machine. /lcrc/group/e3sm/ac.ndkeen/scratch/chrys/s12-jun14/t.F2010-SCREAMv1.ne30pg2_ne30pg2.chrysalis.p2700x1.gnu This result makes me wonder if there is maybe something diff on this machine -- perhaps inputdata files?

None of the other longer-running tests on pm-cpu or frontier have had issues.

I'm now running the same case with increased output. This case has hourly tendencies, but I'm not sure how to view it to help us learn more about what is happening:

/lcrc/group/e3sm/ac.ndkeen/scratch/chrys/s12-jun14/t.F2010-SCREAMv1.ne30pg2_ne30pg2.chrysalis.p2700x1.hourlyTend

crterai commented 1 year ago

I took a look at the output and the e3sm log. The crash 'report' in e3sm log file was quite helpful and based on it,

1287:  -----------------------------------------------------------------------
1287:      surf_sens_flux<ncol>(8)
1287:
1287:   surf_sens_flux(2)
1287:     4679.2,

This value of 4679 Wm-2 is quite high (especially given that max solar flux is 1373 Wm-2), so something is going awry at the surface to be getting that high. Looking at the temperature tendency output that Noel shared above, the maximum temperature tendency comes from SHOC (not surprisingly), and shows the following tendency in the column that shows maximum temperature tendency. It's consistent with a high surface sensible heat flux.

ndkeen commented 1 year ago

I found something that seemed suspicious and originally reported it was giving different results, but I think I confused myself with intel/gnu cases -- it looks like running with/without the e3sm unified env does not impact results (for both gnu and intel).

elif [ "${MACHINE}" == "chrysalis" ]; then

    # Activate conda environment for building                                                                                                                                                                   
    source /lcrc/soft/climate/e3sm-unified/load_latest_e3sm_unified_chrysalis.sh

So false alarm, we still have same issue as before.

ndkeen commented 1 year ago

I tried a few more experiments on chrysalis, including one where I changed the floating-point model for the intel compiler. Currently, this machine uses precise or source, but if I add the even-more-safe -fp-model consistent to both fortran and C++ builds, the case will complete 2 years (ie not crash). This is the flag I use on cori and pm-cpu. I then narrowed it down to the C++ source files only. In cime_config/machines/cmake_macros/intel_chrysalis.cmake, if I only add this flag to CXX builds, seems to be ok.

For the GNU build on chrysalis, which also crashed w similar issue, I was able to work-around it by reducing opt level for C++ source files, but for GNU, I needed this for those C++ builds in ekat, ie in externals/ekat/cmake/EkatSetCompilerFlags.cmake, add: string(APPEND CMAKE_CXX_FLAGS_RELEASE " -O")

ndkeen commented 1 year ago

Using July20th scream repo, I first tried the same 2700x1 case again to verify it fails in same way (it does). /lcrc/group/e3sm/ac.ndkeen/scratch/chrys/s13-jul20/t.F2010-SCREAMv1.ne30pg2_ne30pg2.chrysalis.p2700x1

And then I made the change to add the consistent flag to CXX builds and ran 5 years. It completed here: /lcrc/group/e3sm/ac.ndkeen/scratch/chrys/s13-jul20/t.F2010-SCREAMv1.ne30pg2_ne30pg2.chrysalis.p2700x1.cxxfpcons

Can we see if the output written here is enough to verify it's OK? What's next step?

ndkeen commented 1 year ago

For the case that completed 5 years, I ran another 5 years. And then tried to run longer. It does then crash around model date 0014-12-01 with:

1219: terminate called after throwing an instance of 'std::logic_error'
1219:   what():  /lcrc/group/e3sm/ac.ndkeen/wacmy/s13-jul20/components/eamxx/src/share/atm_process/atmosphere_process.cpp:442: FAIL:
1219: false
1219: Error! Failed post-condition property check (cannot be repaired).
1219:   - Atmosphere process name: shoc
1219:   - Property check name: T_mid within interval [100, 500]
1219:   - Atmosphere process MPI Rank: 1219
1219:   - Message: Check failed.
1219:   - check name: T_mid within interval [100, 500]
1219:   - field id: T_mid[Physics PG2] <double:ncol,lev>(8,128) [K]
1219:   - minimum:
1219:     - value: 209.438
1219:     - entry: (18974,13)
1219:     - lat/lon: (52.6446, 298.907)
1219:   - maximum:
1219:     - value: 513.967
1219:     - entry: (18971,127)
1219:     - lat/lon: (49.3578, 293.3)

/lcrc/group/e3sm/ac.ndkeen/scratch/chrys/s13-jul20/t.F2010-SCREAMv1.ne30pg2_ne30pg2.chrysalis.p2700x1.cxxfpcons

ndkeen commented 1 year ago

I then wanted to try this case on perlmutter for much longer (previously had only ran about 2 years). For pm-gpu and pm-cpu, I started cases -- both using GNU compiler. They both failed with same error.

pm-cpu fails at 00060108

1222:   what():  /global/cfs/cdirs/e3sm/ndk/repos/se70-jul19/components/eamxx/src/share/atm_process/atmosphere_process.cpp:442: FAIL:
1222: false
1222: Error! Failed post-condition property check (cannot be repaired).
1222:   - Atmosphere process name: shoc
1222:   - Property check name: T_mid within interval [100, 500]
1222:   - Atmosphere process MPI Rank: 1222
1222:   - Message: Check failed.
1222:   - check name: T_mid within interval [100, 500]
1222:   - field id: T_mid[Physics PG2] <double:ncol,lev>(8,128) [K]
1222:   - minimum:
1222:     - value: 212.477
1222:     - entry: (18738,8)
1222:     - lat/lon: (52.3016, 309.656)
1222:   - maximum:
1222:     - value: 500.44
1222:     - entry: (18857,127)
1222:     - lat/lon: (52.6412, 304.254)

/pscratch/sd/n/ndk/e3sm_scratch/pm-cpu/se70-jul19/t.gnu.F2010-SCREAMv1.ne30pg2_ne30pg2.pm-cpu.n022.om

pm-gpu fails at 00060102

28:   what():  /global/cfs/cdirs/e3sm/ndk/repos/se70-jul19/components/eamxx/src/share/atm_process/atmosphere_process.cpp:442: FAIL:
28: false
28: Error! Failed post-condition property check (cannot be repaired).
28:   - Atmosphere process name: shoc
28:   - Property check name: T_mid within interval [100, 500]
28:   - Atmosphere process MPI Rank: 28
28:   - Message: Check failed.
28:   - check name: T_mid within interval [100, 500]
28:   - field id: T_mid[Physics PG2] <double:ncol,lev>(336,128) [K]
28:   - minimum:
28:     - value: 191.035
28:     - entry: (18888,14)
28:     - lat/lon: (66.5198, 336.697)
28:   - maximum:
28:     - value: 527.788
28:     - entry: (18852,127)
28:     - lat/lon: (49.7449, 298.117)

/pscratch/sd/n/ndk/e3sm_scratch/pm-gpu/se70-jul19/t.gnugpu.F2010-SCREAMv1.ne30pg2_ne30pg2.pm-gpu.n016.om

crterai commented 1 year ago

In /pscratch/sd/n/ndk/e3sm_scratch/pm-cpu/se70-jul19/t.gnu.F2010-SCREAMv1.ne30pg2_ne30pg2.pm-cpu.n022.om the sensible heat flux is again really high:

1222:      surf_sens_flux<ncol>(8)
1222:
1222:   surf_sens_flux(4)
1222:     4780.03,

And in /pscratch/sd/n/ndk/e3sm_scratch/pm-gpu/se70-jul19/t.gnugpu.F2010-SCREAMv1.ne30pg2_ne30pg2.pm-gpu.n016.om the sensible heat flux is:

28:  -----------------------------------------------------------------------
28:      surf_sens_flux<ncol>(336)
28:
28:   surf_sens_flux(203)
28:     5524.88,
28:  -----------------------------------------------------------------------

crterai commented 1 year ago

I did some analysis on the pm-gpu case, where Noel ran with timestep level output at the surface. I focused on high sensible heat flux cases, since they seemed to occur when we saw the Hot T crashes occurring and they'd explain why the bottom level temperature gets so hot. I didn't see any high SHF values in the first 6 years of this simulation, but saw a few cases in the year 7. I was expecting to see these very large sensible heat flux (SHF) values to be coming from a warming up surface, but there's no indication of this occurring - which is quite surprising. Apart from some numerical issue, I can't think of a physical explanation for why these large SHF values, which in turn warm up the bottom layer, occur.

First, I looked at the global maximum SHF values every timestep to see how large they got. There were a few cases, where the SHF got higher than 1700 Wm-2. For context, the solar constant is ~1300 Wm-2. I mapped out where these cases occurred and found that they occur over land (this is consistent with where all of the hot T crashes have occurred). I was also curious to see if the surface radiative temperature is really high where the SHF gets to be really high for these high SHF cases. But I found that in many cases, the atmosphere was warmer than the surface. Finally, I looked at the time evolution of sensible heat flux, downward SW radiation at the surface, surface T (surf_radiative_T), and T_mid at the bottom level (T_mid_at_model_bot) to see what happens, and the high SHF seems to just occur out of no-where and be quite localized. https://portal.nersc.gov/cfs/e3sm/terai/SCREAM/v1_analysis/ne30_HotT_analysis/SCREAMv1_ne30_HotT_caseA.gif

Very odd...

ndkeen commented 1 year ago

I'm seeing hangs with my runs on pm-gpu. I see them around model date = 00070103 and I've tried about 6 cases now. Some cases restart, others run straight thru. I just repeated the issue with scream repo of Aug 8th. /pscratch/sd/n/ndk/e3sm_scratch/pm-gpu/se71-aug8/t.gnugpu.F2010-SCREAMv1.ne30pg2_ne30pg2.pm-gpu.n016.om.r00

For two of the hangs, I was able to peek into where it was stopped. Here are two different places:

#0  0x0000153885a3f53a in cxip_cq_eq_progress (eq=0xb056270, cq=0xb056150) at prov/cxi/src/cxip_cq.c:508
#1  cxip_cq_progress (cq=0xb056150) at prov/cxi/src/cxip_cq.c:550
#2  0x0000153885a3fca9 in cxip_util_cq_progress (util_cq=0xb056150) at prov/cxi/src/cxip_cq.c:563
#3  0x0000153885a1b111 in ofi_cq_readfrom (cq_fid=0xb056150, buf=<optimized out>, count=8, src_addr=0x0) at prov/util/src/util_cq.c:232
#4  0x0000153888668e72 in MPIR_Wait_impl.part.0 () from /opt/cray/pe/lib64/libmpi_gnu_91.so.12
#5  0x0000153889414df6 in MPIC_Wait () from /opt/cray/pe/lib64/libmpi_gnu_91.so.12
#6  0x0000153889427549 in MPIC_Sendrecv () from /opt/cray/pe/lib64/libmpi_gnu_91.so.12
#7  0x0000153889336f02 in MPIR_Barrier_intra_dissemination () from /opt/cray/pe/lib64/libmpi_gnu_91.so.12
#8  0x0000153887a08291 in MPIR_Barrier_intra_auto () from /opt/cray/pe/lib64/libmpi_gnu_91.so.12
#9  0x0000153887a083b8 in MPIR_Barrier_impl () from /opt/cray/pe/lib64/libmpi_gnu_91.so.12
#10 0x00001538896158ab in MPIR_CRAY_Barrier () from /opt/cray/pe/lib64/libmpi_gnu_91.so.12
#11 0x0000153887a08480 in MPIR_Barrier () from /opt/cray/pe/lib64/libmpi_gnu_91.so.12
#12 0x0000153887a93bae in PMPI_Bcast () from /opt/cray/pe/lib64/libmpi_gnu_91.so.12
#13 0x000015388a34aa45 in pmpi_bcast__ () from /opt/cray/pe/lib64/libmpifort_gnu_91.so.12
#14 0x0000000000eeab07 in __shr_mpi_mod_MOD_shr_mpi_bcastr0 ()
#15 0x0000000000e2a63f in __seq_infodata_mod_MOD_seq_infodata_exchange ()
#16 0x000000000055cf2c in __component_mod_MOD_component_exch ()
#17 0x000000000054afb4 in __cime_comp_mod_MOD_cime_run ()
#18 0x000000000050a25e in main ()

and 

#0  0x00001517ac29cf08 in MPIDI_Cray_shared_mem_coll_bcast () from /opt/cray/pe/lib64/libmpi_gnu_91.so.12
#1  0x00001517ac36f9ef in MPIR_CRAY_Barrier () from /opt/cray/pe/lib64/libmpi_gnu_91.so.12
#2  0x00001517aa762480 in MPIR_Barrier () from /opt/cray/pe/lib64/libmpi_gnu_91.so.12
#3  0x00001517aa7edbae in PMPI_Bcast () from /opt/cray/pe/lib64/libmpi_gnu_91.so.12
#4  0x00001517ad0a4a45 in pmpi_bcast__ () from /opt/cray/pe/lib64/libmpifort_gnu_91.so.12
#5  0x0000000000eeaac7 in __shr_mpi_mod_MOD_shr_mpi_bcastr0 ()
#6  0x0000000000e2a5ff in __seq_infodata_mod_MOD_seq_infodata_exchange ()
#7  0x000000000055cf2c in __component_mod_MOD_component_exch ()
#8  0x000000000054afb4 in __cime_comp_mod_MOD_cime_run ()
#9  0x000000000050a25e in main ()

ndkeen commented 1 year ago

After much testing, @bogensch suggested that we use macmic=12 in SHOC (current default is 6) as this matches the original formulation. When I ran a ne30 case (on pm-gpu) with macmic=12, it ran for at least 115 years. We think this should only affect ne30 runs (?)

I was going to make a PR from this branch: ndk/scream/set-ne30-macmic-12

AaronDonahue commented 1 year ago

Just to make sure, we basically want the shoc timestep to be <= 150s. What is the default ATM timestep for ne120, ne256, ne512? We can then compare against default mac_aero_mic subcycling in EAMxx and capture these cases in your PR too @ndkeen .

PeterCaldwell commented 1 year ago

ne120 has a 5 min SHOC dt so needs this treatment as well. Not sure about ne256 or ne512.

ndkeen commented 1 year ago

I would think we would want to make this change to default ne30 to fix the issue that we know is there. If we want to change other defaults, can make another GH issue as it will likely take testing.

PeterCaldwell commented 1 year ago

We might as well change both ne30 and ne120 defaults since we know they are both wrong. Can you make this PR, Noel? I think we're all in agreement we need to do it, but everyone's hair is on fire and I don't know how to fix it myself...

bartgol commented 1 year ago

Right now, dt_atm is not passed to the atm procs at init time. However, we may change that, so that procs can pre-compute some dt-related quantities. In particular, shoc could do

m_num_subcycles = static_cast<int>(std::ceil ( atm_dt / shoc_dt_max ) );

If we did this, there would be no need to set the number of subcycles from the input file...

I didn't make dt available at init time, since I didn't consider it an init-time parameter. But eam does it, and I don't foresee the coupler implementing time-adaptivity anyways, so we can assume dt is written in stone right from the start, and we may as well use it...

PeterCaldwell commented 1 year ago

I'm kinda creeped out by the idea that the code would silently change the timestep it is using (or even declare that it's changing things in a log file nobody will read). I'd rather it did what the user asked for even if it was dumb. I like how homme gets around this by having the user specify all the timesteps they want to use rather than the number of substeps to take for certain things.

In this case, though, I think we should just change the default number of substeps for macro and micro for ne30 and ne120 and move on. We might need to have a further PR for ne256 and ne512, but we can do so if/when we encounter a problem (which I doubt we will).

E3SM-Project / scream

Model crash at ne30 -- T above 500k (likely fixed with macmic=12) #2381