E3SM-Project / scream

Fork of E3SM used to develop exascale global atmosphere model written in C++
https://e3sm-project.github.io/scream/
Other
78 stars 56 forks source link

First attempts to run ne30 with v1 cime case on Perlmutter #1479

Closed ndkeen closed 2 years ago

ndkeen commented 2 years ago

Using a launch script similar to SMS_D_Ln2_P4x1.ne30_ne30.F2000SCREAMv1.perlmutter_gnu, and following directions here:

https://acme-climate.atlassian.net/wiki/spaces/NGDNA/pages/3330506773/Getting+running+at+higher+resolution#Steps-for-running-at-ne30%3A

One thing I did differently is I'm using a cdf5 format netcdf file. That is use:

/global/cfs/cdirs/e3sm/inputdata/atm/scream/init/spa_file_unified_and_complete_ne30_scream_cdf5.nc
instead of
/global/cfs/cdirs/e3sm/inputdata/atm/scream/init/spa_file_unified_and_complete_ne30_scream.nc

I tried a case on Perlmutter using a cpu-only build. This is using 1 node, but with 4 MPI's. I also see same error message using 4 nodes and 4 MPI's (1 MPI per node). The following is from a DEBUG attempt and I will also try without DEBUG.

0: CalcWorkPerBlock: Total blocks:          1 Ice blocks:          1 IceFree blocks:          0 Land blocks:          0
0: Atmosphere step = 0; model time = 0001-01-01 00:00:00
0: e3sm.exe: /pscratch/sd/n/ndk/wacmy/s14-mar12/components/homme/src/share/cxx/PpmRemap.hpp:535: Homme::Remap::Ppm::PpmVertRemap<boundaries>::compute_partitions<Homme::Remap::Ppm::PpmLimitedExtrap>::<lambda(const int&)>::<lambda()>: Assertion `fabs(m_pio(kv.ie, igp, jgp, NUM_PHYSICAL_LEV) - m_pin(kv.ie, igp, jgp, NUM_PHYSICAL_LEV)) < 1.0' failed.
0: 
0: Program received signal SIGABRT: Process abort signal.
0: 
0: Backtrace for this error:
0: #0  0x145d553983df in ???
0: #1  0x145d55398360 in ???
0: #2  0x145d55399940 in ???
0: #3  0x145d55390a59 in ???
0: #4  0x145d55390ad1 in ???
0: #5  0x2ab1d61 in _ZZZNK5Homme5Remap3Ppm12PpmVertRemapINS1_16PpmLimitedExtrapEE18compute_partitionsERNS_15KernelVariablesEN6Kokkos4ViewIA4_A4_A5_KN13KokkosKernels7Batched12Experimental6VectorINSB_9VectorTagINSB_4SIMDIdNS7_6SerialEEELi16EEEEEJNS7_11LayoutRightENS7_9HostSpaceENS7_12MemoryTraitsILj9EEEEEESR_ENKUlRKiE_clEST_ENKUlvE1_clEv
0:      at /pscratch/sd/n/ndk/wacmy/s14-mar12/components/homme/src/share/cxx/PpmRemap.hpp:535
0: #6  0x2ac13a8 in _ZN6Kokkos6singleINS_4Impl20HostThreadTeamMemberINS_6SerialEEEZZNK5Homme5Remap3Ppm12PpmVertRemapINS7_16PpmLimitedExtrapEE18compute_partitionsERNS5_15KernelVariablesENS_4ViewIA4_A4_A5_KN13KokkosKernels7Batched12Experimental6VectorINSG_9VectorTagINSG_4SIMDIdS3_EELi16EEEEEJNS_11LayoutRightENS_9HostSpaceENS_12MemoryTraitsILj9EEEEEESV_ENKUlRKiE_clESX_EUlvE1_EENSt9enable_ifIXsrNS1_26is_host_thread_team_memberIT_EE5valueEvE4typeERKNS1_18VectorSingleStructIS12_EERKT0_
0:      at /pscratch/sd/n/ndk/wacmy/s14-mar12/externals/ekat/extern/kokkos/core/src/impl/Kokkos_HostThreadTeam.hpp:1072
0: #7  0x2ab15a9 in _ZZNK5Homme5Remap3Ppm12PpmVertRemapINS1_16PpmLimitedExtrapEE18compute_partitionsERNS_15KernelVariablesEN6Kokkos4ViewIA4_A4_A5_KN13KokkosKernels7Batched12Experimental6VectorINSB_9VectorTagINSB_4SIMDIdNS7_6SerialEEELi16EEEEEJNS7_11LayoutRightENS7_9HostSpaceENS7_12MemoryTraitsILj9EEEEEESR_ENKUlRKiE_clEST_
0:      at /pscratch/sd/n/ndk/wacmy/s14-mar12/components/homme/src/share/cxx/PpmRemap.hpp:524
0: #8  0x2ac1434 in _ZN6Kokkos12parallel_forIiZNK5Homme5Remap3Ppm12PpmVertRemapINS3_16PpmLimitedExtrapEE18compute_partitionsERNS1_15KernelVariablesENS_4ViewIA4_A4_A5_KN13KokkosKernels7Batched12Experimental6VectorINSC_9VectorTagINSC_4SIMDIdNS_6SerialEEELi16EEEEEJNS_11LayoutRightENS_9HostSpaceENS_12MemoryTraitsILj9EEEEEESS_EUlRKiE_NS_4Impl20HostThreadTeamMemberISG_EEEEvRKNSW_31TeamThreadRangeBoundariesStructIT_T1_EERKT0_PPKNSt9enable_ifIXsrNSW_26is_host_thread_team_memberIS11_EE5valueEvE4typeE
0:      at /pscratch/sd/n/ndk/wacmy/s14-mar12/externals/ekat/extern/kokkos/core/src/impl/Kokkos_HostThreadTeam.hpp:842
0: #9  0x2ab24ee in _ZNK5Homme5Remap3Ppm12PpmVertRemapINS1_16PpmLimitedExtrapEE18compute_partitionsERNS_15KernelVariablesEN6Kokkos4ViewIA4_A4_A5_KN13KokkosKernels7Batched12Experimental6VectorINSB_9VectorTagINSB_4SIMDIdNS7_6SerialEEELi16EEEEEJNS7_11LayoutRightENS7_9HostSpaceENS7_12MemoryTraitsILj9EEEEEESR_
0:      at /pscratch/sd/n/ndk/wacmy/s14-mar12/components/homme/src/share/cxx/PpmRemap.hpp:490
0: #10  0x2aac32b in _ZNK5Homme5Remap3Ppm12PpmVertRemapINS1_16PpmLimitedExtrapEE19compute_grids_phaseERNS_15KernelVariablesEN6Kokkos4ViewIA4_A4_A5_KN13KokkosKernels7Batched12Experimental6VectorINSB_9VectorTagINSB_4SIMDIdNS7_6SerialEEELi16EEEEEJNS7_11LayoutRightENS7_9HostSpaceENS7_12MemoryTraitsILj9EEEEEESR_
0:      at /pscratch/sd/n/ndk/wacmy/s14-mar12/components/homme/src/share/cxx/PpmRemap.hpp:196
0: #11  0x2ada1c9 in _ZNK5Homme5Remap12RemapFunctorILb1ENS0_3Ppm12PpmVertRemapINS2_16PpmLimitedExtrapEEEEclENS6_15ComputeGridsTagERKN6Kokkos4Impl20HostThreadTeamMemberINS8_6SerialEEE
0:      at /pscratch/sd/n/ndk/wacmy/s14-mar12/components/homme/src/share/cxx/RemapFunctor.hpp:393
0: #12  0x2ad6aa2 in _ZNK6Kokkos4Impl11ParallelForIN5Homme5Remap12RemapFunctorILb1ENS3_3Ppm12PpmVertRemapINS5_16PpmLimitedExtrapEEEEENS_10TeamPolicyIJNS_6SerialENS_12LaunchBoundsILj512ELj1EEENS9_15ComputeGridsTagEEEESB_E4execISE_EENSt9enable_ifIXntsrSt7is_sameIT_vE5valueEvE4typeERNS0_18HostThreadTeamDataE
0:      at /pscratch/sd/n/ndk/wacmy/s14-mar12/externals/ekat/extern/kokkos/core/src/Kokkos_Serial.hpp:960
0: #13  0x2ad1f33 in _ZNK6Kokkos4Impl11ParallelForIN5Homme5Remap12RemapFunctorILb1ENS3_3Ppm12PpmVertRemapINS5_16PpmLimitedExtrapEEEEENS_10TeamPolicyIJNS_6SerialENS_12LaunchBoundsILj512ELj1EEENS9_15ComputeGridsTagEEEESB_E7executeEv
0:      at /pscratch/sd/n/ndk/wacmy/s14-mar12/externals/ekat/extern/kokkos/core/src/Kokkos_Serial.hpp:979
0: #14  0x2acc08a in _ZN6Kokkos12parallel_forINS_10TeamPolicyIJNS_6SerialENS_12LaunchBoundsILj512ELj1EEEN5Homme5Remap12RemapFunctorILb1ENS6_3Ppm12PpmVertRemapINS8_16PpmLimitedExtrapEEEE15ComputeGridsTagEEEESC_EEvRKT_RKT0_RKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEPNSt9enable_ifIXsrNS_19is_execution_policyISF_EE5valueEvE4typeE
0:      at /pscratch/sd/n/ndk/wacmy/s14-mar12/externals/ekat/extern/kokkos/core/src/Kokkos_Parallel.hpp:142
0: #15  0x2ac46d4 in _ZN6Kokkos12parallel_forINS_10TeamPolicyIJNS_6SerialENS_12LaunchBoundsILj512ELj1EEEN5Homme5Remap12RemapFunctorILb1ENS6_3Ppm12PpmVertRemapINS8_16PpmLimitedExtrapEEEE15ComputeGridsTagEEEESC_EEvRKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEERKT_RKT0_
0:      at /pscratch/sd/n/ndk/wacmy/s14-mar12/externals/ekat/extern/kokkos/core/src/Kokkos_Parallel.hpp:173
0: #16  0x2ab77fc in _ZN5Homme5Remap12RemapFunctorILb1ENS0_3Ppm12PpmVertRemapINS2_16PpmLimitedExtrapEEEE11run_functorINS6_15ComputeGridsTagEEEvNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEi
0:      at /pscratch/sd/n/ndk/wacmy/s14-mar12/components/homme/src/share/cxx/RemapFunctor.hpp:587
0: #17  0x2aadec9 in _ZN5Homme5Remap12RemapFunctorILb1ENS0_3Ppm12PpmVertRemapINS2_16PpmLimitedExtrapEEEE9run_remapEv
0:      at /pscratch/sd/n/ndk/wacmy/s14-mar12/components/homme/src/share/cxx/RemapFunctor.hpp:460
0: #18  0x2aa8d53 in _ZN5Homme5Remap12RemapFunctorILb1ENS0_3Ppm12PpmVertRemapINS2_16PpmLimitedExtrapEEEE9run_remapEiid
0:      at /pscratch/sd/n/ndk/wacmy/s14-mar12/components/homme/src/share/cxx/RemapFunctor.hpp:441
0: #19  0x2a9205c in _ZNK5Homme20VerticalRemapManager9run_remapEiid
0:      at /pscratch/sd/n/ndk/wacmy/s14-mar12/components/homme/src/share/cxx/VerticalRemapManager.cpp:128
0: #20  0x2a91c2a in _ZN5Homme14vertical_remapEd
0:      at /pscratch/sd/n/ndk/wacmy/s14-mar12/components/homme/src/share/cxx/vertical_remap.cpp:21
0: #21  0x2a8e1ed in prim_run_subcycle_c
0:      at /pscratch/sd/n/ndk/wacmy/s14-mar12/components/homme/src/share/cxx/prim_driver.cpp:157
0: #22  0x27d8b46 in __prim_driver_mod_MOD_prim_run_subcycle
0:      at /pscratch/sd/n/ndk/wacmy/s14-mar12/components/homme/src/theta-l_kokkos/prim_driver_mod.F90:434
0: #23  0x2751193 in prim_run_f90
0:      at /pscratch/sd/n/ndk/wacmy/s14-mar12/components/scream/src/dynamics/homme/interface/homme_driver_mod.F90:219
0: #24  0x26cf3c7 in _ZN6scream13HommeDynamics8run_implEi
0:      at /pscratch/sd/n/ndk/wacmy/s14-mar12/components/scream/src/dynamics/homme/atmosphere_dynamics.cpp:410
0: #25  0x2f9d9f5 in _ZN6scream17AtmosphereProcess3runEi
0:      at /pscratch/sd/n/ndk/wacmy/s14-mar12/components/scream/src/share/atm_process/atmosphere_process.cpp:46
0: #26  0x2fad087 in _ZN6scream22AtmosphereProcessGroup14run_sequentialEd
0:      at /pscratch/sd/n/ndk/wacmy/s14-mar12/components/scream/src/share/atm_process/atmosphere_process_group.cpp:158
0: #27  0x2facf97 in _ZN6scream22AtmosphereProcessGroup8run_implEi
0:      at /pscratch/sd/n/ndk/wacmy/s14-mar12/components/scream/src/share/atm_process/atmosphere_process_group.cpp:145
0: #28  0x2f9d9f5 in _ZN6scream17AtmosphereProcess3runEi
0:      at /pscratch/sd/n/ndk/wacmy/s14-mar12/components/scream/src/share/atm_process/atmosphere_process.cpp:46
0: #29  0x25584ec in _ZN6scream7control16AtmosphereDriver3runEi
0:      at /pscratch/sd/n/ndk/wacmy/s14-mar12/components/scream/src/control/atmosphere_driver.cpp:802
0: #30  0x5c15eb in operator()
0:      at /pscratch/sd/n/ndk/wacmy/s14-mar12/components/scream/src/mct_coupling/scream_cxx_f90_interface.cpp:204
0: #31  0x5c1f15 in fpe_guard_wrapper<scream_run(const Real&)::<lambda()> >
0:      at /pscratch/sd/n/ndk/wacmy/s14-mar12/components/scream/src/mct_coupling/scream_cxx_f90_interface.cpp:50
0: #32  0x5c160e in scream_run
0:      at /pscratch/sd/n/ndk/wacmy/s14-mar12/components/scream/src/mct_coupling/scream_cxx_f90_interface.cpp:201
0: #33  0x5bd6a4 in __atm_comp_mct_MOD_atm_run_mct
0:      at /pscratch/sd/n/ndk/wacmy/s14-mar12/components/scream/src/mct_coupling/atm_comp_mct.F90:209
0: #34  0x43ec7c in __component_mod_MOD_component_run
0:      at /pscratch/sd/n/ndk/wacmy/s14-mar12/driver-mct/main/component_mod.F90:728
0: #35  0x423d90 in __cime_comp_mod_MOD_cime_run
0:      at /pscratch/sd/n/ndk/wacmy/s14-mar12/driver-mct/main/cime_comp_mod.F90:3082
0: #36  0x43c334 in cime_driver
0:      at /pscratch/sd/n/ndk/wacmy/s14-mar12/driver-mct/main/cime_driver.F90:153
0: #37  0x43c397 in main
0:      at /pscratch/sd/n/ndk/wacmy/s14-mar12/driver-mct/main/cime_driver.F90:23
ndkeen commented 2 years ago

After https://github.com/E3SM-Project/scream/pull/1500 (which is now in scream master), I confirmed that I can now run 2 steps WITH SPA (cpu-only) and that output format is now cdf5.

ndkeen commented 2 years ago

More good news -- using March 23 master that includes upstream merge and the above mentioned pr1500 -- I am able to run 2 steps with ne30 using GPU's. I first tried 1 node with 4 MPI/GPU's which also fails with the pool error, but when I try using 4 PM nodes with 16 MPI/GPU's, it completes. This test was built DEBUG and had output off. So it's possible that the GPU attempts above just needed more memory.

However, running beyond 2 steps (in fact on the next step):

10: Negative (or nan) layer thickness detected, aborting!

I also tried ne30 cpu-only with output and it seems OK for 2 steps.

With CPU-only and OPT build, I see a new error:

15: array_io_read failed with: /pscratch/sd/n/ndk/wacmy/s17-mar23/externals/ekat/src/ekat/util/ekat_file_utils.hpp:24: FAIL:
15: nread == sz
15: read: nread = 1023 sz = 3000
15: WARNING: SPA Remap File has been set to 'NONE', assuming that SPA data and simulation are on the same grid - skipping horizontal interpolation p3_iso_c::p3_init: One or more table files exists but gave a read error.
 0: WARNING: SPA Remap File has been set to 'NONE', assuming that SPA data and simulation are on the same grid - skipping horizontal interpolation p3_iso_c::p3_init: One or more table files exists but gave a read error.
 0: array_io_read failed with: /pscratch/sd/n/ndk/wacmy/s17-mar23/externals/ekat/src/ekat/util/ekat_file_utils.hpp:24: FAIL:
 0: nread == sz
 0: read: nread = 1023 sz = 3000
15: terminate called after throwing an instance of 'std::logic_error'
15:   what():  /pscratch/sd/n/ndk/wacmy/s17-mar23/components/scream/src/physics/p3/p3_f90.cpp:122: FAIL:
15: info == 0
15: p3_init_c returned info -1
15:

Note the actual error message looks odd because the SPA warning message is missing a newline (https://github.com/E3SM-Project/scream/issues/1434).

PeterCaldwell commented 2 years ago

Regarding the CPU-only OPT error: I'm surprised v1 is calling p3_f90.cpp at all - isn't that a bridge function just used for F90/C++ BFB testing? Do you have a stack trace for this, Noel? Does calling p3_f90.cpp make sense to you, @AaronDonahue or @jgfouca ? It looks to me like the fail in p3_init is in reading the table lookup file. It would be nice if we had a log message stating which file we are reading whenever we read a file (at least in debug mode).

Regarding the step 3 fail on GPUs: did you ever try running with a shorter dt, Noel?

ndkeen commented 2 years ago

The only fail in init right now is https://github.com/E3SM-Project/scream/issues/1505 which only happens sometimes.

I did try with se_tstep: 100 and don't see anything different.

I have also tried changing ATM_NCPL without seeing a difference, but as it's confusing, it might be good to specify what would be good settings for me to try and I can do that.

Update: I just tried ATM_NCPL=720 which should be a 2 min dtime with se_tstep: 60 and was able to complete 4 steps (on PM with GPU). I tried again for longer on both PM/GPU and cori-knl and it stops in the same way for both machines after 5th step:

  0: Atmosphere step = 5; model time = 0001-01-01 00:10:00
450:  ERROR:
450:  component_mod:check_fields NaN found in ATM instance:    1 field Sa_z 1d global
450:   index:    32761
450: Image              PC                Routine            Line        Source
450: e3sm.exe           0000000008B1EABA  Unknown               Unknown  Unknown
450: e3sm.exe           000000000557480A  shr_abort_mod_mp_         114  shr_abort_mod.F90
450: e3sm.exe           0000000005574670  shr_abort_mod_mp_          61  shr_abort_mod.F90
450: e3sm.exe           000000000048332A  component_type_mo         257  component_type_mod.F90
450: e3sm.exe           000000000047B1A8  component_mod_mp_         754  component_mod.F90
450: e3sm.exe           000000000043A8A7  cime_comp_mod_mp_        3077  cime_comp_mod.F90
450: e3sm.exe           00000000004628D0  MAIN__                    153  cime_driver.F90

Peter wanted me to try with a simple halving of timestep, so that's ATM_NCPL=576" or dtime=150 seconds andse_tstep: 150`. With this it fails with NLT after step 4.

  0: Atmosphere step = 4; model time = 0001-01-01 00:10:00
674: Negative (or nan) layer thickness detected, aborting!
674: Exiting...
PeterCaldwell commented 2 years ago

Great, thanks Noel. So it seems like we can run almost twice as far with a timestep that's ~twice as long. That is what we would expect if it was a real physical instability rather than something that always keys off the 3rd timestep or something. Are you saving output every timestep? If not, could you, then point me to the output? You can change the output frequency in run/data/scream_output.yaml. Change "Frequency" near the bottom of that file. Also probably want to change "Max Snapshots per Field" to 1 so we are sure to get all the output up to the time it crashes (since netcdfs often don't get flushed until they are closed).

ndkeen commented 2 years ago

Here is a case where I think I'm writing every step.

/pscratch/sd/n/ndk/e3sm_scratch/perlmutter/s18-mar23/f30.F2000SCREAMv1.ne30_ne30.s18-mar23.gnugpu.12s.n003a4x8.DEBUG.Hremap512.K0def.WSM.Q10.nospa.nan.N576.ts150.os1

PeterCaldwell commented 2 years ago

Wow, this is an interesting case. There's definitely a bug in SW radiation. Check out surface SW down at timestep 1 (all timesteps are similar): Screen Shot 2022-03-28 at 5 09 30 PM Which I think is responsible for the "psychedelic haircut" in T_mid (also from step 1). Note this is T_mid at the top of the model. The surface looks normal: Screen Shot 2022-03-28 at 5 13 33 PM Note in particular the "mole" of bright yellow below the right side of the haircut. This mole is ultimately what grows to 132,042K by step 5 and causes the model to crash. I'm not sure whether the mole is related to the radiation problem or is something different. Do we know whether the sponge layer is active in our CIME runs?

One other thing to note is that LWdn also displays the "haircut" geometry in step 1 and grows to -369,543 W/m2 over the mole by the end of the simulation. I think this is a natural reaction to the ridiculous T_mid in the mole, but just mentioning it.

Overall, I think the haircut is due to rad info not getting passed to MPIs correctly. @ndkeen - could you do the same run with the same output, but with 2x more MPIs? @brhillman - does the haircut raise any bells for you?

ndkeen commented 2 years ago

Can it be just a different number of MPI's? That was with 12 MPI's -- OK to try 8 MPI's or 16 MPI's?

PeterCaldwell commented 2 years ago

Yeah, any number of MPIs is fine. Knowing that it's 12 MPIs makes me think that's not the problem. The haircut looks like 4 MPIs misplaced... but we should still try a different MPI count just in case. Now I'm thinking it is something related to the zenith angle or the grid.

ndkeen commented 2 years ago

Yea, just gets thru Q quicker with 4 or less nodes. This should be same thing but on 8 MPI's instead of 12.

/global/cfs/cdirs/e3sm/ndk/f30.F2000SCREAMv1.ne30_ne30.s18-mar23.gnugpu.12s.n002a4x8.DEBUG.Hremap512.K00def.WSM.Q10.nospa.nan.N576.ts150.os1

PeterCaldwell commented 2 years ago

The 8 MPI run looks identical to the 12 MPI version (which is good!). So I guess MPIs aren't the problem. I think adding print statements to the code around where we're getting the divide by zero and adding zenith angle(?) to the output is the next step...

mt5555 commented 2 years ago

to check if sponge layer is active in SCREAM v0: look for "raytay0" and "nu_top" settings in namelist. I dont think SCREAM v1 has Rayleigh friction option, so hopefully SCREAM v0 was run with raytay0=0.
nu_top is resolution dependent, but should be 2.5e5 for 1degree.

more details: https://acme-climate.atlassian.net/wiki/spaces/DOC/pages/2967798203/EAM+Top+of+Model+Sponge+Layer

PeterCaldwell commented 2 years ago

@mt5555 - I just checked the run which I plotted above (which is a v1 run, not a v0 run) and found nu_top = 250000.0. raytay0 is not included. Is just setting nu_top sufficient for turning on sponge? Is 250,000 the appropriate number for ne30? I'll try an ne30 v0 run, but probably won't have time until this afternoon...

jgfouca commented 2 years ago

@PeterCaldwell , in order to avoid duplication, we left some of the p3 init stuff in fortran. I wonder, now that we are leaving the fortran behind, it might make sense to move it all over to C++.

PeterCaldwell commented 2 years ago

Ok, thanks Jim. It would be good to be pure C++, but I don't think we need to do that now. I was just surprised that v1 was calling F90 in P3.

ndkeen commented 2 years ago

Using -DRRTMGP_EXPENSIVE_CHECKS on builds, I found an issue a little higher up food chain which led to Ben fixing an issue in the input file. With the new one /global/cfs/cdirs/e3sm/bhillma/scream/data/init/screami_ne30np4L72_20220329.nc, we've made more progress with ne30. I'm still trying to see what situations cause crashes, what's slow, and what's non-BFB.

ndkeen commented 2 years ago

Now that we can kinda say we are running ne30 as a cime case, maybe could close this issue and make more specific ones.

Currently, I was able to run over a day with default dtime=300s, spa, and some output. However, I'm hitting some fails that are happening at seemingly random points (ie not at same step). With OPT builds, there is no useful information about the fail. Trying again with DEBUG and I see the following error that happened twice at 2 different steps (in this case, step 62 and 72):

 3: FATAL ERROR:
 3: gas_optics(): array tsfc has values outside range

which we know is coming from code under RRTMGP_EXPENSIVE_CHECKS macro.

Also, as Conrad pointed out and I confirmed on PM, with GPU builds, we are not BFB after 2nd step between 2 otherwise identical runs. This may be OK as there are known issues with rad not being BFB on GPU. We do expect the CPU-only cases to be BFB between runs and I've verified that is true (at least by looking at values written to e3sm.log). I tested with and without threads -- within the case, 2 runs are BFB with each other.

ndkeen commented 2 years ago

Closing this issue as we are beyond this point and some of the steps here are no longer valid.