E3SM-Project / scream

Fork of E3SM used to develop exascale global atmosphere model written in C++
https://e3sm-project.github.io/scream/
Other
78 stars 56 forks source link

First attempts to run ne30 with v1 cime case on Perlmutter #1479

Closed ndkeen closed 2 years ago

ndkeen commented 2 years ago

Using a launch script similar to SMS_D_Ln2_P4x1.ne30_ne30.F2000SCREAMv1.perlmutter_gnu, and following directions here:

https://acme-climate.atlassian.net/wiki/spaces/NGDNA/pages/3330506773/Getting+running+at+higher+resolution#Steps-for-running-at-ne30%3A

One thing I did differently is I'm using a cdf5 format netcdf file. That is use:

/global/cfs/cdirs/e3sm/inputdata/atm/scream/init/spa_file_unified_and_complete_ne30_scream_cdf5.nc
instead of
/global/cfs/cdirs/e3sm/inputdata/atm/scream/init/spa_file_unified_and_complete_ne30_scream.nc

I tried a case on Perlmutter using a cpu-only build. This is using 1 node, but with 4 MPI's. I also see same error message using 4 nodes and 4 MPI's (1 MPI per node). The following is from a DEBUG attempt and I will also try without DEBUG.

0: CalcWorkPerBlock: Total blocks:          1 Ice blocks:          1 IceFree blocks:          0 Land blocks:          0
0: Atmosphere step = 0; model time = 0001-01-01 00:00:00
0: e3sm.exe: /pscratch/sd/n/ndk/wacmy/s14-mar12/components/homme/src/share/cxx/PpmRemap.hpp:535: Homme::Remap::Ppm::PpmVertRemap<boundaries>::compute_partitions<Homme::Remap::Ppm::PpmLimitedExtrap>::<lambda(const int&)>::<lambda()>: Assertion `fabs(m_pio(kv.ie, igp, jgp, NUM_PHYSICAL_LEV) - m_pin(kv.ie, igp, jgp, NUM_PHYSICAL_LEV)) < 1.0' failed.
0: 
0: Program received signal SIGABRT: Process abort signal.
0: 
0: Backtrace for this error:
0: #0  0x145d553983df in ???
0: #1  0x145d55398360 in ???
0: #2  0x145d55399940 in ???
0: #3  0x145d55390a59 in ???
0: #4  0x145d55390ad1 in ???
0: #5  0x2ab1d61 in _ZZZNK5Homme5Remap3Ppm12PpmVertRemapINS1_16PpmLimitedExtrapEE18compute_partitionsERNS_15KernelVariablesEN6Kokkos4ViewIA4_A4_A5_KN13KokkosKernels7Batched12Experimental6VectorINSB_9VectorTagINSB_4SIMDIdNS7_6SerialEEELi16EEEEEJNS7_11LayoutRightENS7_9HostSpaceENS7_12MemoryTraitsILj9EEEEEESR_ENKUlRKiE_clEST_ENKUlvE1_clEv
0:      at /pscratch/sd/n/ndk/wacmy/s14-mar12/components/homme/src/share/cxx/PpmRemap.hpp:535
0: #6  0x2ac13a8 in _ZN6Kokkos6singleINS_4Impl20HostThreadTeamMemberINS_6SerialEEEZZNK5Homme5Remap3Ppm12PpmVertRemapINS7_16PpmLimitedExtrapEE18compute_partitionsERNS5_15KernelVariablesENS_4ViewIA4_A4_A5_KN13KokkosKernels7Batched12Experimental6VectorINSG_9VectorTagINSG_4SIMDIdS3_EELi16EEEEEJNS_11LayoutRightENS_9HostSpaceENS_12MemoryTraitsILj9EEEEEESV_ENKUlRKiE_clESX_EUlvE1_EENSt9enable_ifIXsrNS1_26is_host_thread_team_memberIT_EE5valueEvE4typeERKNS1_18VectorSingleStructIS12_EERKT0_
0:      at /pscratch/sd/n/ndk/wacmy/s14-mar12/externals/ekat/extern/kokkos/core/src/impl/Kokkos_HostThreadTeam.hpp:1072
0: #7  0x2ab15a9 in _ZZNK5Homme5Remap3Ppm12PpmVertRemapINS1_16PpmLimitedExtrapEE18compute_partitionsERNS_15KernelVariablesEN6Kokkos4ViewIA4_A4_A5_KN13KokkosKernels7Batched12Experimental6VectorINSB_9VectorTagINSB_4SIMDIdNS7_6SerialEEELi16EEEEEJNS7_11LayoutRightENS7_9HostSpaceENS7_12MemoryTraitsILj9EEEEEESR_ENKUlRKiE_clEST_
0:      at /pscratch/sd/n/ndk/wacmy/s14-mar12/components/homme/src/share/cxx/PpmRemap.hpp:524
0: #8  0x2ac1434 in _ZN6Kokkos12parallel_forIiZNK5Homme5Remap3Ppm12PpmVertRemapINS3_16PpmLimitedExtrapEE18compute_partitionsERNS1_15KernelVariablesENS_4ViewIA4_A4_A5_KN13KokkosKernels7Batched12Experimental6VectorINSC_9VectorTagINSC_4SIMDIdNS_6SerialEEELi16EEEEEJNS_11LayoutRightENS_9HostSpaceENS_12MemoryTraitsILj9EEEEEESS_EUlRKiE_NS_4Impl20HostThreadTeamMemberISG_EEEEvRKNSW_31TeamThreadRangeBoundariesStructIT_T1_EERKT0_PPKNSt9enable_ifIXsrNSW_26is_host_thread_team_memberIS11_EE5valueEvE4typeE
0:      at /pscratch/sd/n/ndk/wacmy/s14-mar12/externals/ekat/extern/kokkos/core/src/impl/Kokkos_HostThreadTeam.hpp:842
0: #9  0x2ab24ee in _ZNK5Homme5Remap3Ppm12PpmVertRemapINS1_16PpmLimitedExtrapEE18compute_partitionsERNS_15KernelVariablesEN6Kokkos4ViewIA4_A4_A5_KN13KokkosKernels7Batched12Experimental6VectorINSB_9VectorTagINSB_4SIMDIdNS7_6SerialEEELi16EEEEEJNS7_11LayoutRightENS7_9HostSpaceENS7_12MemoryTraitsILj9EEEEEESR_
0:      at /pscratch/sd/n/ndk/wacmy/s14-mar12/components/homme/src/share/cxx/PpmRemap.hpp:490
0: #10  0x2aac32b in _ZNK5Homme5Remap3Ppm12PpmVertRemapINS1_16PpmLimitedExtrapEE19compute_grids_phaseERNS_15KernelVariablesEN6Kokkos4ViewIA4_A4_A5_KN13KokkosKernels7Batched12Experimental6VectorINSB_9VectorTagINSB_4SIMDIdNS7_6SerialEEELi16EEEEEJNS7_11LayoutRightENS7_9HostSpaceENS7_12MemoryTraitsILj9EEEEEESR_
0:      at /pscratch/sd/n/ndk/wacmy/s14-mar12/components/homme/src/share/cxx/PpmRemap.hpp:196
0: #11  0x2ada1c9 in _ZNK5Homme5Remap12RemapFunctorILb1ENS0_3Ppm12PpmVertRemapINS2_16PpmLimitedExtrapEEEEclENS6_15ComputeGridsTagERKN6Kokkos4Impl20HostThreadTeamMemberINS8_6SerialEEE
0:      at /pscratch/sd/n/ndk/wacmy/s14-mar12/components/homme/src/share/cxx/RemapFunctor.hpp:393
0: #12  0x2ad6aa2 in _ZNK6Kokkos4Impl11ParallelForIN5Homme5Remap12RemapFunctorILb1ENS3_3Ppm12PpmVertRemapINS5_16PpmLimitedExtrapEEEEENS_10TeamPolicyIJNS_6SerialENS_12LaunchBoundsILj512ELj1EEENS9_15ComputeGridsTagEEEESB_E4execISE_EENSt9enable_ifIXntsrSt7is_sameIT_vE5valueEvE4typeERNS0_18HostThreadTeamDataE
0:      at /pscratch/sd/n/ndk/wacmy/s14-mar12/externals/ekat/extern/kokkos/core/src/Kokkos_Serial.hpp:960
0: #13  0x2ad1f33 in _ZNK6Kokkos4Impl11ParallelForIN5Homme5Remap12RemapFunctorILb1ENS3_3Ppm12PpmVertRemapINS5_16PpmLimitedExtrapEEEEENS_10TeamPolicyIJNS_6SerialENS_12LaunchBoundsILj512ELj1EEENS9_15ComputeGridsTagEEEESB_E7executeEv
0:      at /pscratch/sd/n/ndk/wacmy/s14-mar12/externals/ekat/extern/kokkos/core/src/Kokkos_Serial.hpp:979
0: #14  0x2acc08a in _ZN6Kokkos12parallel_forINS_10TeamPolicyIJNS_6SerialENS_12LaunchBoundsILj512ELj1EEEN5Homme5Remap12RemapFunctorILb1ENS6_3Ppm12PpmVertRemapINS8_16PpmLimitedExtrapEEEE15ComputeGridsTagEEEESC_EEvRKT_RKT0_RKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEPNSt9enable_ifIXsrNS_19is_execution_policyISF_EE5valueEvE4typeE
0:      at /pscratch/sd/n/ndk/wacmy/s14-mar12/externals/ekat/extern/kokkos/core/src/Kokkos_Parallel.hpp:142
0: #15  0x2ac46d4 in _ZN6Kokkos12parallel_forINS_10TeamPolicyIJNS_6SerialENS_12LaunchBoundsILj512ELj1EEEN5Homme5Remap12RemapFunctorILb1ENS6_3Ppm12PpmVertRemapINS8_16PpmLimitedExtrapEEEE15ComputeGridsTagEEEESC_EEvRKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEERKT_RKT0_
0:      at /pscratch/sd/n/ndk/wacmy/s14-mar12/externals/ekat/extern/kokkos/core/src/Kokkos_Parallel.hpp:173
0: #16  0x2ab77fc in _ZN5Homme5Remap12RemapFunctorILb1ENS0_3Ppm12PpmVertRemapINS2_16PpmLimitedExtrapEEEE11run_functorINS6_15ComputeGridsTagEEEvNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEi
0:      at /pscratch/sd/n/ndk/wacmy/s14-mar12/components/homme/src/share/cxx/RemapFunctor.hpp:587
0: #17  0x2aadec9 in _ZN5Homme5Remap12RemapFunctorILb1ENS0_3Ppm12PpmVertRemapINS2_16PpmLimitedExtrapEEEE9run_remapEv
0:      at /pscratch/sd/n/ndk/wacmy/s14-mar12/components/homme/src/share/cxx/RemapFunctor.hpp:460
0: #18  0x2aa8d53 in _ZN5Homme5Remap12RemapFunctorILb1ENS0_3Ppm12PpmVertRemapINS2_16PpmLimitedExtrapEEEE9run_remapEiid
0:      at /pscratch/sd/n/ndk/wacmy/s14-mar12/components/homme/src/share/cxx/RemapFunctor.hpp:441
0: #19  0x2a9205c in _ZNK5Homme20VerticalRemapManager9run_remapEiid
0:      at /pscratch/sd/n/ndk/wacmy/s14-mar12/components/homme/src/share/cxx/VerticalRemapManager.cpp:128
0: #20  0x2a91c2a in _ZN5Homme14vertical_remapEd
0:      at /pscratch/sd/n/ndk/wacmy/s14-mar12/components/homme/src/share/cxx/vertical_remap.cpp:21
0: #21  0x2a8e1ed in prim_run_subcycle_c
0:      at /pscratch/sd/n/ndk/wacmy/s14-mar12/components/homme/src/share/cxx/prim_driver.cpp:157
0: #22  0x27d8b46 in __prim_driver_mod_MOD_prim_run_subcycle
0:      at /pscratch/sd/n/ndk/wacmy/s14-mar12/components/homme/src/theta-l_kokkos/prim_driver_mod.F90:434
0: #23  0x2751193 in prim_run_f90
0:      at /pscratch/sd/n/ndk/wacmy/s14-mar12/components/scream/src/dynamics/homme/interface/homme_driver_mod.F90:219
0: #24  0x26cf3c7 in _ZN6scream13HommeDynamics8run_implEi
0:      at /pscratch/sd/n/ndk/wacmy/s14-mar12/components/scream/src/dynamics/homme/atmosphere_dynamics.cpp:410
0: #25  0x2f9d9f5 in _ZN6scream17AtmosphereProcess3runEi
0:      at /pscratch/sd/n/ndk/wacmy/s14-mar12/components/scream/src/share/atm_process/atmosphere_process.cpp:46
0: #26  0x2fad087 in _ZN6scream22AtmosphereProcessGroup14run_sequentialEd
0:      at /pscratch/sd/n/ndk/wacmy/s14-mar12/components/scream/src/share/atm_process/atmosphere_process_group.cpp:158
0: #27  0x2facf97 in _ZN6scream22AtmosphereProcessGroup8run_implEi
0:      at /pscratch/sd/n/ndk/wacmy/s14-mar12/components/scream/src/share/atm_process/atmosphere_process_group.cpp:145
0: #28  0x2f9d9f5 in _ZN6scream17AtmosphereProcess3runEi
0:      at /pscratch/sd/n/ndk/wacmy/s14-mar12/components/scream/src/share/atm_process/atmosphere_process.cpp:46
0: #29  0x25584ec in _ZN6scream7control16AtmosphereDriver3runEi
0:      at /pscratch/sd/n/ndk/wacmy/s14-mar12/components/scream/src/control/atmosphere_driver.cpp:802
0: #30  0x5c15eb in operator()
0:      at /pscratch/sd/n/ndk/wacmy/s14-mar12/components/scream/src/mct_coupling/scream_cxx_f90_interface.cpp:204
0: #31  0x5c1f15 in fpe_guard_wrapper<scream_run(const Real&)::<lambda()> >
0:      at /pscratch/sd/n/ndk/wacmy/s14-mar12/components/scream/src/mct_coupling/scream_cxx_f90_interface.cpp:50
0: #32  0x5c160e in scream_run
0:      at /pscratch/sd/n/ndk/wacmy/s14-mar12/components/scream/src/mct_coupling/scream_cxx_f90_interface.cpp:201
0: #33  0x5bd6a4 in __atm_comp_mct_MOD_atm_run_mct
0:      at /pscratch/sd/n/ndk/wacmy/s14-mar12/components/scream/src/mct_coupling/atm_comp_mct.F90:209
0: #34  0x43ec7c in __component_mod_MOD_component_run
0:      at /pscratch/sd/n/ndk/wacmy/s14-mar12/driver-mct/main/component_mod.F90:728
0: #35  0x423d90 in __cime_comp_mod_MOD_cime_run
0:      at /pscratch/sd/n/ndk/wacmy/s14-mar12/driver-mct/main/cime_comp_mod.F90:3082
0: #36  0x43c334 in cime_driver
0:      at /pscratch/sd/n/ndk/wacmy/s14-mar12/driver-mct/main/cime_driver.F90:153
0: #37  0x43c397 in main
0:      at /pscratch/sd/n/ndk/wacmy/s14-mar12/driver-mct/main/cime_driver.F90:23
PeterCaldwell commented 2 years ago

huh. This fail is in homme remap. I think the fail you slacked me earlier today was in pio. Sanity check question: is “se_ne: 30” set correctly in scream_input.yaml?

ndkeen commented 2 years ago

When I try with only 1 MPI, I get message more clearly indicating OOM:

0:  Initing prim data structures...
0: terminate called after throwing an instance of 'std::runtime_error'
0:   what():  Kokkos failed to allocate memory for label "Qdp_dyn".  Allocation using MemorySpace named "Host" failed with the following error:  Allocation of size 1.718e+10 G failed, likely due to insufficient memory.  (The allocation mechanism was standard malloc().)
0:
0: Program received signal SIGABRT: Process abort signal.
0:
0: Backtrace for this error:
0: #0  0x14b51f7723df in ???
0: #1  0x14b51f772360 in ???
0: #2  0x14b51f773940 in ???
0: #3  0x14b525adbf59 in _ZN9__gnu_cxx27__verbose_terminate_handlerEv
0:      at ../../../../cpe-gcc-11.2.0-202108140355.9bf1fd589a5c1/libstdc++-v3/libsupc++/vterminate.cc:95
0: #4  0x14b525ae77b9 in _ZN10__cxxabiv111__terminateEPFvvE
0:      at ../../../../cpe-gcc-11.2.0-202108140355.9bf1fd589a5c1/libstdc++-v3/libsupc++/eh_terminate.cc:48
0: #5  0x14b525ae7824 in _ZSt9terminatev
0:      at ../../../../cpe-gcc-11.2.0-202108140355.9bf1fd589a5c1/libstdc++-v3/libsupc++/eh_terminate.cc:58
0: #6  0x14b525ae7ab7 in __cxa_throw
0:      at ../../../../cpe-gcc-11.2.0-202108140355.9bf1fd589a5c1/libstdc++-v3/libsupc++/eh_throw.cc:95
0: #7  0x32985b2 in _ZN6Kokkos4Impl23throw_runtime_exceptionERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE
0:      at /pscratch/sd/n/ndk/wacmy/s14-mar12/externals/ekat/extern/kokkos/core/src/impl/Kokkos_Error.cpp:72
0: #8  0x329bfbe in _ZN6Kokkos4Impl41safe_throw_allocation_with_header_failureERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES8_RKNS_12Experimental26RawMemoryAllocationFailureE
0:      at /pscratch/sd/n/ndk/wacmy/s14-mar12/externals/ekat/extern/kokkos/core/src/impl/Kokkos_MemorySpace.cpp:79
0: #9  0x32a730e in _ZN6Kokkos4Impl30checked_allocation_with_headerINS_9CudaSpaceEEEPNS0_22SharedAllocationHeaderERKT_RKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEm
0:      at /pscratch/sd/n/ndk/wacmy/s14-mar12/externals/ekat/extern/kokkos/core/src/impl/Kokkos_MemorySpace.hpp:76
0: #10  0x32a4f03 in _ZN6Kokkos4Impl22SharedAllocationRecordINS_9CudaSpaceEvEC2ERKS2_RKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEmPFvPNS1_IvvEEE
0:      at /pscratch/sd/n/ndk/wacmy/s14-mar12/externals/ekat/extern/kokkos/core/src/Cuda/Kokkos_CudaSpace.cpp:531
0: #11  0x2798c1a in _ZN6Kokkos4Impl22SharedAllocationRecordINS_9CudaSpaceENS0_16ViewValueFunctorINS_6DeviceINS_4CudaES2_EEcLb1EEEEC2ERKS2_RKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEm
0:      at /pscratch/sd/n/ndk/wacmy/s14-mar12/externals/ekat/extern/kokkos/core/src/impl/Kokkos_SharedAlloc.hpp:343
0: #12  0x27944f5 in _ZN6Kokkos4Impl22SharedAllocationRecordINS_9CudaSpaceENS0_16ViewValueFunctorINS_6DeviceINS_4CudaES2_EEcLb1EEEE8allocateERKS2_RKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEm
0:      at /pscratch/sd/n/ndk/wacmy/s14-mar12/externals/ekat/extern/kokkos/core/src/impl/Kokkos_SharedAlloc.hpp:364
0: #13  0x31bdb80 in _ZN6Kokkos4Impl11ViewMappingINS_10ViewTraitsIPcJNS_11LayoutRightENS_6DeviceINS_4CudaENS_9CudaSpaceEEENS_12MemoryTraitsILj0EEEEEEJvEE15allocate_sharedIJNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESt17integral_constantIjLj0EES7_S6_EEEPNS0_22Share\
dAllocationRecordIvvEERKNS0_12ViewCtorPropIJDpT_EEERKS4_
0:      at /pscratch/sd/n/ndk/wacmy/s14-mar12/externals/ekat/extern/kokkos/core/src/impl/Kokkos_ViewMapping.hpp:3362
0: #14  0x31bcfbc in _ZN6Kokkos4ViewIPcJNS_11LayoutRightENS_6DeviceINS_4CudaENS_9CudaSpaceEEENS_12MemoryTraitsILj0EEEEEC2IJNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEEEERKNS_4Impl12ViewCtorPropIJDpT_EEERKNSt9enable_ifIXntsrSL_11has_pointerES2_E4typeE
0:      at /pscratch/sd/n/ndk/wacmy/s14-mar12/externals/ekat/extern/kokkos/core/src/Kokkos_View.hpp:1551
0: #15  0x31bbd5b in _ZN6Kokkos4ViewIPcJNS_11LayoutRightENS_6DeviceINS_4CudaENS_9CudaSpaceEEENS_12MemoryTraitsILj0EEEEEC2INSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEEERKT_NSt9enable_ifIXsrNS_4Impl13is_view_labelISH_EE5valueEKmE4typeEmmmmmmm
0:      at /pscratch/sd/n/ndk/wacmy/s14-mar12/externals/ekat/extern/kokkos/core/src/Kokkos_View.hpp:1694
0: #16  0x31b665b in _ZN6scream5Field13allocate_viewEv
0:      at /pscratch/sd/n/ndk/wacmy/s14-mar12/components/scream/src/share/field/field.cpp:130
0: #17  0x2719b5d in _ZN6scream13HommeDynamics19create_helper_fieldERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEERKSt6vectorINS_8FieldTagESaISA_EERKS9_IiSaIiEES8_
0:      at /pscratch/sd/n/ndk/wacmy/s14-mar12/components/scream/src/dynamics/homme/atmosphere_dynamics.cpp:700
0: #18  0x271129c in _ZN6scream13HommeDynamics9set_gridsESt10shared_ptrIKNS_12GridsManagerEE
0:      at /pscratch/sd/n/ndk/wacmy/s14-mar12/components/scream/src/dynamics/homme/atmosphere_dynamics.cpp:173
0: #19  0x319ba1b in _ZN6scream22AtmosphereProcessGroup9set_gridsESt10shared_ptrIKNS_12GridsManagerEE
0:      at /pscratch/sd/n/ndk/wacmy/s14-mar12/components/scream/src/share/atm_process/atmosphere_process_group.cpp:95
0: #20  0x255f5c2 in _ZN6scream7control16AtmosphereDriver12create_gridsEv
0:      at /pscratch/sd/n/ndk/wacmy/s14-mar12/components/scream/src/control/atmosphere_driver.cpp:146
0: #21  0x5c1eca in operator()
0:      at /pscratch/sd/n/ndk/wacmy/s14-mar12/components/scream/src/mct_coupling/scream_cxx_f90_interface.cpp:120

If I can trust what's written in e3sm.log as being up-to-date, it looks like the 1 node 1-MPI case that hits OOM is failing before the 1 node 4-MPI case. Which I realize doesn't make sense.

fwiw, I see this same error when running 1 MPI using cpu-only build and with a 1 MPI case using GPU build.

PeterCaldwell commented 2 years ago

@bartgol , @mt5555 , @oksanaguba - could one of you check out Noel's errors above or assign the task to someone else involved in the dycore? Both fails are showing up in the dycore (on 1 MPI, trying to allocate an eye-watering 1.718e+10 GB (scroll right on the error message in Qdp_dyn and for 4 MPIs within 1 node within components/homme/src/share/cxx/PpmRemap.hpp:535. There's a good chance the problem is how we set things up rather than anything in HOMME, but I suspect you can tell us what's going on quicker than we can figure it out ourselves...

ndkeen commented 2 years ago

When I make the slurm gpu change noted in https://github.com/E3SM-Project/scream/issues/1443 and try again with 4 nodes, 1 MPI on each, I now get a different error:

0:    rearth: 6376000.000000
0:
0: **********************************************************
2: :0: : block: [4891,0,0], thread: [0,32,0] Assertion `View bounds error of view Workspace.m_data` failed.
2: :0: : block: [4891,0,0], thread: [0,33,0] Assertion `View bounds error of view Workspace.m_data` failed.
2: :0: : block: [4891,0,0], thread: [0,34,0] Assertion `View bounds error of view Workspace.m_data` failed.
2: :0: : block: [4891,0,0], thread: [0,35,0] Assertion `View bounds error of view Workspace.m_data` failed.
2: :0: : block: [4891,0,0], thread: [0,36,0] Assertion `View bounds error of view Workspace.m_data` failed.
2: :0: : block: [4891,0,0], thread: [0,37,0] Assertion `View bounds error of view Workspace.m_data` failed.
2: :0: : block: [4891,0,0], thread: [0,38,0] Assertion `View bounds error of view Workspace.m_data` failed.
2: :0: : block: [4891,0,0], thread: [0,39,0] Assertion `View bounds error of view Workspace.m_data` failed.
2: :0: : block: [4891,0,0], thread: [0,40,0] Assertion `View bounds error of view Workspace.m_data` failed.
2: :0: : block: [4891,0,0], thread: [0,41,0] Assertion `View bounds error of view Workspace.m_data` failed.
2: :0: : block: [4891,0,0], thread: [0,42,0] Assertion `View bounds error of view Workspace.m_data` failed.
2: :0: : block: [4891,0,0], thread: [0,43,0] Assertion `View bounds error of view Workspace.m_data` failed.
2: :0: : block: [4891,0,0], thread: [0,44,0] Assertion `View bounds error of view Workspace.m_data` failed.
2: :0: : block: [4891,0,0], thread: [0,45,0] Assertion `View bounds error of view Workspace.m_data` failed.
2: :0: : block: [4891,0,0], thread: [0,46,0] Assertion `View bounds error of view Workspace.m_data` failed.

(I see the same error with 1 MPI as in previous comments -- which is expected)

mt5555 commented 2 years ago

the 1 node allocation error: Can Noel add some print statements before the malloc()? either the variables have been corrupted due to some other bug, or the error message is just wrong.

Regarding the error in the remap: that's code suggesting the levels have gone bad. But the fact that it went away with the gpu change suggests we dont worry about that. Instead, we need to fix the "View bounds error ...". For that, I think @bartgol has to look at it.

ndkeen commented 2 years ago

For the 1-MPI run that hits OOM, here are a few debug prints:

0: ndk HommeDynamics::create_helper_field name=                                             v_dyn pack_size=1
0: ndk Field::allocate_view view_dim=   298598400 id.name()=v_dyn
0: ndk HommeDynamics::create_helper_field name=                                     vtheta_dp_dyn pack_size=1
0: ndk Field::allocate_view view_dim=   149299200 id.name()=vtheta_dp_dyn
0: ndk HommeDynamics::create_helper_field name=                                          dp3d_dyn pack_size=1
0: ndk Field::allocate_view view_dim=   149299200 id.name()=dp3d_dyn
0: ndk HommeDynamics::create_helper_field name=                                         w_int_dyn pack_size=1
0: ndk Field::allocate_view view_dim=   151372800 id.name()=w_int_dyn
0: ndk HommeDynamics::create_helper_field name=                                       phi_int_dyn pack_size=1
0: ndk Field::allocate_view view_dim=   151372800 id.name()=phi_int_dyn
0: ndk HommeDynamics::create_helper_field name=                                            ps_dyn pack_size=1
0: ndk Field::allocate_view view_dim=     2073600 id.name()=ps_dyn
0: ndk HommeDynamics::create_helper_field name=                                          phis_dyn pack_size=1
0: ndk Field::allocate_view view_dim=      691200 id.name()=phis_dyn
0: ndk HommeDynamics::set_grids Qdp_dyn nelem=      5400 QTL=         2 HOMMEXX_QSIZE_D=        35 NP=         4 nlev_mid=        72 NGP=         4 NTL=         3 N=         1
0: terminate called after throwing an instance of 'std::runtime_error'
0:   what():  Kokkos failed to allocate memory for label "Qdp_dyn".  Allocation using MemorySpace named "Cuda" failed with the following error:  Allocation of size 1.718e+10 G failed, likely due to insufficient memory.  (The allocation mechanism was cudaMalloc().  The Cuda allocation returned the error code ""cudaErrorMemory\
Allocation".)
0:
0: Traceback functionality not available
0:
0:
0: Program received signal SIGABRT: Process abort signal.
0:
0: Backtrace for this error:
0: ndk HommeDynamics::create_helper_field name=                                           Qdp_dyn pack_size=1
0: ndk Field::allocate_view view_dim=  -811319296 id.name()=Qdp_dyn

The last prints are written after the error message in the log file, but clearly view_dim is out of whack -- not sure yet how to print the way it's computed.

bartgol commented 2 years ago

The view sizes seem correct. Take v_dyn, whose size is reported to be 298598400 bytes, that is, 37324800 doubles. The layout for v is (nelem,3,2,np,np,nlevs), so with ne=30 (=>nelem=5400), 72 levels, np=4, you get precisely 37324800 doubles. So v_dyn takes about 300MB. Things get quite expensive for Qdp, which is roughly 10 times larger than v_dyn.

However, IIRC, the A100 on PM have 40GB of memory. Counting just the dyn memory consumption, we have about 3.5 GB for Qdp, while the other states, and the fields needed for backing up tendencies probably add up to another 4-5 GB (might be more, but not 2x more, I think). So that's <9GB for dyn. Even if phys consumes as much (which is not likely, since dyn has time levels, so uses more memory), we should not get close to fill the GPU memory just with standard FM variables.

I should point out that for the SC paper runs, on a single V100 we were able to fit hommexx-nh up to ne21, and, when turning on WorkspaceManager support, even ne30. I am not sure if WSM is turned on by default in Homme within a SCREAM build, but even if it's not, the large memory of PM's A100 gpus should accommodate ne30, I suspect.

I will do a more careful count of our memory consumption in the morning.

P.s: Perhaps we should add a method for the FM, to quickly retrieve the currently allocated size.

bartgol commented 2 years ago

The view bound error might be caused by a WorkspaceManager setup incorrectly, causing OOB access to the m_data scratch pad.

PeterCaldwell commented 2 years ago

Thanks for the print statements Noel and the memory analysis Luca. This makes me feel better. I don't think we necessarily need ne30 to fit on a single node right now, we just want to be able to run at all using a reasonable number of nodes. I think the important thing here is to figure out the view bound error...

Why isn't the "slurm gpu change noted in #1443" on master yet? To be clear - I'm not complaining, just curious.

ndkeen commented 2 years ago

Perhaps I should have highlighted the view_dim= -811319296 printed above. I'm guessing this is integer overflow. This would be what then causes:

Kokkos failed to allocate memory for label "Qdp_dyn". Allocation using MemorySpace named "Cuda" failed with the following error: Allocation of size 1.718e+10 G failed, likely due to insufficient memory.

ndkeen commented 2 years ago

I just started experimenting with those slurm setting yesterday evening. Our current settings are the way NERSC suggest, were working fine for all other uses, plus I'm not even sure why it would make a difference. I think we should understand what's happening first. Also, this is a change that would need to go thru E3SM master first.

I need to test, but we might be able to pass these args thru case.submit as a temp work-around.

I noted on the other issue as well, but as a work-around I tested this works: ./case.submit -a="--gpus-per-task=0 --gpu-bind=none --gpus=$np" where $np is number of MPI tasks.

bartgol commented 2 years ago

Perhaps I should have highlighted the view_dim= -811319296 printed above. I'm guessing this is integer overflow. This would be what then causes:

Kokkos failed to allocate memory for label "Qdp_dyn". Allocation using MemorySpace named "Cuda" failed with the following error: Allocation of size 1.718e+10 G failed, likely due to insufficient memory.

Ah, yes, I didn't catch that. I suppose we should switch to 64 bits ints (or even size_t) for alloc sizes. This was a bit of a oversight on my end. I'll open a separate issue so I don't forget to fix it.

mt5555 commented 2 years ago

Regarding the memory error: From the message above, we can see that qsize_d=35, so the model is allocating space for 35 tracers (actual number of tracers is a runtime argument and is probably 10). qsize_d could be lowered to 10. The Qdp_dyn array at NE30, with 35 tracers is about 6.5e8 numbers - so it is not overflowing the int32 index space. But the total number of bytes is overflowing int*32.

So my conclusion:

PeterCaldwell commented 2 years ago

Ok, so we have the todo items of

  1. make qsize_d=10 the default for EAMxx since we always use SPA. This isn't a bugfix in itself, just using memory more efficiently... which might help us fit our runs onto fewer nodes.
  2. replace the int32 indexing with int64 or size_t for alloc sizes. This is actually a bug because the code shouldn't overflow for all reasonable problem sizes.

@bartgol - are you working on item 2? Does someone know how to do item 1? Seems trivial but I don't know where in the code this is set...

ndkeen commented 2 years ago

Regarding the errors with multiple MPIs:

:0: : block: [4891,0,0], thread: [0,32,0] Assertion `View bounds error of view Workspace.m_data` failed.

Does that ring a bell for anyone? Just not sure best way to debug.

Here is a little more of the output with one print just before the parallelfor:

1: ndk HommeDynamics::initialize_homme_state NGP=         4 nelem=      2700 nlevs=  72 qsize=        10 npacks_mid=        72 npacks_int=        73 n0=1 n0_qdp=         0
0: ndk HommeDynamics::initialize_homme_state NGP=         4 nelem=      2700 nlevs=  72 qsize=        10 npacks_mid=        72 npacks_int=        73 n0=1 n0_qdp=         0
0: :0: : block: [2359,0,0], thread: [0,0,0] Assertion `` failed.
0: :0: : block: [2360,0,0], thread: [0,0,0] Assertion `` failed.
0: :0: : block: [3026,0,0], thread: [0,0,0] Assertion `` failed.
0: :0: : block: [3027,0,0], thread: [0,0,0] Assertion `` failed.
0: :0: : block: [3025,0,0], thread: [0,0,0] Assertion `` failed.
0: :0: : block: [4021,0,0], thread: [0,0,0] Assertion `` failed.
0: :0: : block: [5186,0,0], thread: [0,0,0] Assertion `` failed.
0: :0: : block: [5758,0,0], thread: [0,0,0] Assertion `` failed.
0: :0: : block: [14540,0,0], thread: [0,0,0] Assertion `` failed.
0: :0: : block: [16849,0,0], thread: [0,32,0] Assertion `View bounds error of view Workspace.m_data` failed.
0: :0: : block: [16849,0,0], thread: [0,33,0] Assertion `View bounds error of view Workspace.m_data` failed.
0: :0: : block: [16849,0,0], thread: [0,34,0] Assertion `View bounds error of view Workspace.m_data` failed.
0: :0: : block: [16849,0,0], thread: [0,35,0] Assertion `View bounds error of view Workspace.m_data` failed.
0: :0: : block: [16849,0,0], thread: [0,36,0] Assertion `View bounds error of view Workspace.m_data` failed.
...
0: KERNEL CHECK FAILED:
0:    !m_parent.m_active(m_ws_idx, slot)
0:
...
0: #8  0x32a7cf3 in _ZN6Kokkos4Impl25cuda_internal_error_throwE9cudaErrorPKcS3_i
0:      at /pscratch/sd/n/ndk/wacmy/s14-mar12/externals/ekat/extern/kokkos/core/src/Cuda/Kokkos_Cuda_Instance.cpp:183
0: #9  0x5c9003 in _ZN6Kokkos4Impl23cuda_internal_safe_callE9cudaErrorPKcS3_i
0:      at /pscratch/sd/n/ndk/wacmy/s14-mar12/externals/ekat/extern/kokkos/core/src/Cuda/Kokkos_Cuda_Error.hpp:73
0: #10  0x32a7b31 in operator()
0:      at /pscratch/sd/n/ndk/wacmy/s14-mar12/externals/ekat/extern/kokkos/core/src/Cuda/Kokkos_Cuda_Instance.cpp:161
0: #11  0x32aaa46 in profile_fence_event<Kokkos::Cuda, Kokkos::Impl::cuda_stream_synchronize(cudaStream_t, const Kokkos::Impl::CudaInternal*, const string&)::<lambda()> >
0:      at /pscratch/sd/n/ndk/wacmy/s14-mar12/externals/ekat/extern/kokkos/core/src/impl/Kokkos_Profiling.hpp:180
0: #12  0x32a7b71 in _ZN6Kokkos4Impl23cuda_stream_synchronizeEP11CUstream_stPKNS0_12CudaInternalERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE
0:      at /pscratch/sd/n/ndk/wacmy/s14-mar12/externals/ekat/extern/kokkos/core/src/Cuda/Kokkos_Cuda_Instance.cpp:156
0: #13  0x32a83b1 in _ZNK6Kokkos4Impl12CudaInternal5fenceERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE
0:      at /pscratch/sd/n/ndk/wacmy/s14-mar12/externals/ekat/extern/kokkos/core/src/Cuda/Kokkos_Cuda_Instance.cpp:345
0: #14  0x277daef in ???
0: #15  0x276eb03 in ???
0: #16  0x275f1b7 in ???
0: #17  0x274d9cd in _ZN6Kokkos12parallel_forINS_10TeamPolicyIJNS_4CudaEEEE18__nv_hdl_wrapper_tILb0ELb0E11__nv_dl_tagIMN6scream13HommeDynamicsEFvvEXadL_ZNS7_22initialize_homme_stateEvEELj1EEFvRKNS_4Impl14CudaTeamMemberEEJKiN4ekat16WorkspaceManagerINSH_4PackIdLi1EEENS_6DeviceIS2_NS_9CudaSpaceEEEEEKNS_4ViewIPPPPPSK_JNS_11Layou\
tRightESN_NS_12MemoryTraitsILj0EEEEEESG_SG_KdSZ_SZ_SG_SZ_KNSP_IPPPS10_JSV_SN_SX_EEEEEEEvRKT_RKT0_RKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEPNSt9enable_ifIXsrNS_19is_execution_p^@^@^@^@^@^@^@^@^@^Q^@^@^@^@^@^@^@\315\331t^B^@^@^@^@!\264\223        at /pscratch/sd/n/ndk/wacmy/s14-mar12/externals/ekat/extern/kokkos/c\
ore/src/Kokkos_Parallel.hpp:142
0: #18  0x271ecec in _ZN6scream13HommeDynamics22initialize_homme_stateEv
0:      at /pscratch/sd/n/ndk/wacmy/s14-mar12/components/scream/src/dynamics/homme/atmosphere_dynamics.cpp:997
0: #19  0x27146e7 in _ZN6scream13HommeDynamics15initialize_implENS_7RunTypeE
0:      at /pscratch/sd/n/ndk/wacmy/s14-mar12/components/scream/src/dynamics/homme/atmosphere_dynamics.cpp:360

I think we know it stops here components/scream/src/dynamics/homme/atmosphere_dynamics.cpp:

  // Need two temporaries, for pi_mid and pi_int                                                                                                                                                                       
  ekat::WorkspaceManager<Pack,DefaultDevice> wsm(npacks_int,2,policy);
  printf("ndk HommeDynamics::initialize_homme_state NGP=%10d nelem=%10d nlevs=%4d qsize=%10d npacks_mid=%10d npacks_int=%10d n0=%d n0_qdp=%10d\n",
         NGP, nelem, nlevs, qsize, npacks_mid, npacks_int, n0, n0_qdp);
  Kokkos::parallel_for(policy, KOKKOS_LAMBDA (const KT::MemberType& team) {
    const int ie  =  team.league_rank() / (NGP*NGP);
    const int igp = (team.league_rank() / NGP) % NGP;
    const int jgp =  team.league_rank() % NGP;

    // Compute p_mid                                                                                                                                                                                                   
    auto ws = wsm.get_workspace(team);
    auto p_int = ws.take("p_int");
    auto p_mid = ws.take("p_mid");

    auto dp = ekat::subview(dp_view,ie,n0,igp,jgp);

    ColOps::column_scan<true>(team,nlevs,dp,p_int,ps0);
    team.team_barrier();
    ColOps::compute_midpoint_values(team,nlevs,p_int,p_mid);
    team.team_barrier();

    // Convert T->Theta->VTheta->VTheta*dp in place                                                                                                                                                                    
    auto T      = ekat::subview(vth_view,ie,n0,igp,jgp);
    auto vTh_dp = ekat::subview(vth_view,ie,n0,igp,jgp);
    auto qv     = ekat::subview(Q_view,ie,0,igp,jgp);
    Kokkos::parallel_for(Kokkos::TeamThreadRange(team,npacks_mid),
                         [&](const int ilev) {
      const auto th = PF::calculate_theta_from_T(T(ilev),p_mid(ilev));
      vTh_dp(ilev) = PF::calculate_virtual_temperature(th,qv(ilev))*dp(ilev);
    });
    team.team_barrier();

    // Init geopotential                                                                                                                                                                                               
    auto dphi   = [&](const int ilev)->Pack {
      return EOS::compute_dphi(vTh_dp(ilev), p_mid(ilev));
    };
    auto phi_int = ekat::subview(phi_int_view,ie,n0,igp,jgp);
    ColOps::column_scan<false>(team,nlevs,dphi,phi_int,phis_dyn_view(ie,igp,jgp));

    // Release the scratch mem                                                                                                                                                                                         
    ws.release(p_int);
    ws.release(p_mid);
  });
mt5555 commented 2 years ago

I was thinking Fortran where allocate takes the number of doubles - in which case the int32 indexing with 10 tracers, wont overflow until we get to 40,000 elements per GPU. But I think malloc() in C++ code allocates bytes, which overflows int32 around 5400 elements.

So I think my conclusions above were wrong: we are not running out of memory on 1 node, it really is just an integer overflow.

bartgol commented 2 years ago

Ok, so we have the todo items of

  1. make qsize_d=10 the default for EAMxx since we always use SPA. This isn't a bugfix in itself, just using memory more efficiently... which might help us fit our runs onto fewer nodes.
  2. replace the int32 indexing with int64 or size_t for alloc sizes. This is actually a bug because the code shouldn't overflow for all reasonable problem sizes.

@bartgol - are you working on item 2? Does someone know how to do item 1? Seems trivial but I don't know where in the code this is set...

  1. it's a simple mod to config_component.xml. But we have to be sure we are not using too few, or Homme will crap out. Does anyone know the exact (maybe +/-2) default number of tracers when p3/shoc/spa/rrtmgp are all present? Once I have that, I can do the 1 line change.
  2. Yeah, I'm on it.
mt5555 commented 2 years ago

the number of tracers is determined by CIME and passed to the build script. We can hard code it for now of course, but it would be good to put that on the todo list.

PeterCaldwell commented 2 years ago

So Mark - you're saying that we should add code to CIME which figures out based on our compset how many tracers we have and stuffs that info into config_component.xml? @jgfouca - do you know how this is done in EAMf90 (and can you add implementing this in EAMxx as a low-priority task)?

I think the advected constituents are

  1. qv
  2. qc
  3. nc
  4. qr
  5. nr
  6. qi
  7. ni
  8. qm
  9. bm
  10. tke

Am I missing any? @bogensch , do you know?

mt5555 commented 2 years ago

what I wrote is probably incorrect: it's EAM's "configure" that determines the number of tracers based on the features that are turned on/off and command line options to add test tracers. So not part of CIME, and completely up to SCREAM's equivalent of "configure".

bartgol commented 2 years ago

what I wrote is probably incorrect: it's EAM's "configure" that determines the number of tracers based on the features that are turned on/off and command line options to add test tracers. So not part of CIME, and completely up to SCREAM's equivalent of "configure".

Yeah, that's simple to do then.

@PeterCaldwell aren't there any rrtmgp tracers? I thought there were some, but maybe I'm confusing them with static (optical) air properties...

bogensch commented 2 years ago

@PeterCaldwell I just verified in an atm.log* that those are in fact the only 10 advected constituents. Screen Shot 2022-03-16 at 3 00 30 PM

PeterCaldwell commented 2 years ago

FYI #1483 fixed Noel's bounds indexing problem but I still get the same OOM errors on quartz I've always gotten (with no stack trace or other indication of where or what is going wrong)...

ndkeen commented 2 years ago

Peter: to judge where you are running out of memory, what is the last line of e3sm.log before error messages? For me, it's:

0:    rearth: 6376000.000000
0:
0: **********************************************************

Well I guess I know a little more more -- that it makes it to the function HommeDynamics::initialize_homme_state as I have a print statement there.

PeterCaldwell commented 2 years ago

Good point. I'm getting slurmstepd: error: Detected 118 oom-kill event(s) in StepId=8886583.0 cgroup just after printing the physics state after step 0. In particular, the last lines before error messages are

     TBOT=   0.217437398823339E+03  0.309197140715885E+03
      ps=   0.549327413032868E+05  0.103836960556827E+06  0.851954916819729E+10
      M =   0.100468080169703E+05 kg/m^2  0.985206069036937E+05mb    

I'm unclear whether this is before or after the "rearth" line that Noel ends on. I think rearth should be part of the "CXX Simulation Parameters" dump that I do get through, but rearth isn't printed for me for some reason.

Also, does anyone know what StepID means? This is failing on the first timestep... I wish we could run 8886583 steps!

bartgol commented 2 years ago

@PeterCaldwell do you know how much memory do quartz nodes have?

PeterCaldwell commented 2 years ago

128 GB/node. I get the same OOM failure when I use 30 nodes (5 elements/node), so I don't think this is a legit lack of memory problem (though I don't know what memory highwater per element to expect). If Noel doesn't get OOMs running ne30 on a single PM node, I would think we can't use that much memory...

bartgol commented 2 years ago

Oh, gosh, no, we don't use that much memory. We would be in serious trouble otherwise.

It could be related to uninited memory though. Some uninited integer ends up causing a ridicolous size to be used in an allocation. We do have some valgrind fails reported in the nightlies, so there may be a connection. I'll check.

PeterCaldwell commented 2 years ago

Update: Running the full model at ne30 in standalone mode rather than the CIME build (without any parallelism at all!) gets further than my OOM point noted above. The last few lines of e3sm.log are:

TBOT=   0.217437398823339E+03  0.309197140715885E+03
      ps=   0.549327413032868E+05  0.103836960556827E+06  0.851954916819729E+10
      M =   0.100468080169703E+05 kg/m^2  0.985206069036937E+05mb     
WARNING: SPA Remap File has been set to 'NONE', assuming that SPA data and simulation are on the same grid - skipping horizontal interpolationStart time stepping loop...       [  0%]
Atmosphere step = 0; model time = 2021-10-12 12:30:00
Negative (or nan) layer thickness detected, aborting!
Exiting...
PeterCaldwell commented 2 years ago

After changing the timestep from 30 min to 5 min in my standalone full-model ne30 test on quartz, I'm able to run 1 step before hitting a segfault in some sort of radiation interpolation. x11 itself segfaulted before I could analyze the stack trace!

ndkeen commented 2 years ago

With a "black magic" change Luca sent me to WSM in components/scream/src/dynamics/homme/atmosphere_dynamics.cpp, the run now goes beyond where it was assert-failing above. I now see the following with GPU DEBUG and 1 MPI:

0:       M =   0.566597937394141E+04 kg/m^2  0.555615002975693E+05mb
0: PIO: FATAL ERROR: Aborting... FATAL ERROR: NetCDF: Start+count exceeds dimension bound (/pscratch/sd/n/ndk/wacmy/s14-mar12/externals/scorpio/src/clib/pio_darray_int.c: 1504)
0: Obtained 10 stack frames.
0: /pscratch/sd/n/ndk/e3sm_scratch/perlmutter/s14-mar12/f30.F2000SCREAMv1.ne30_ne30.s14-mar12.gnugpu.2s.n001a1x1.DEBUG.Hremap512.K0def.G4.WSM/bld/e3sm.exe() [0x24e10e1]
0: /pscratch/sd/n/ndk/e3sm_scratch/perlmutter/s14-mar12/f30.F2000SCREAMv1.ne30_ne30.s14-mar12.gnugpu.2s.n001a1x1.DEBUG.Hremap512.K0def.G4.WSM/bld/e3sm.exe() [0x24e12ac]
0: /pscratch/sd/n/ndk/e3sm_scratch/perlmutter/s14-mar12/f30.F2000SCREAMv1.ne30_ne30.s14-mar12.gnugpu.2s.n001a1x1.DEBUG.Hremap512.K0def.G4.WSM/bld/e3sm.exe() [0x24e163d]
0: /pscratch/sd/n/ndk/e3sm_scratch/perlmutter/s14-mar12/f30.F2000SCREAMv1.ne30_ne30.s14-mar12.gnugpu.2s.n001a1x1.DEBUG.Hremap512.K0def.G4.WSM/bld/e3sm.exe() [0x25274cc]
0: /pscratch/sd/n/ndk/e3sm_scratch/perlmutter/s14-mar12/f30.F2000SCREAMv1.ne30_ne30.s14-mar12.gnugpu.2s.n001a1x1.DEBUG.Hremap512.K0def.G4.WSM/bld/e3sm.exe() [0x2521c6c]
0: /pscratch/sd/n/ndk/e3sm_scratch/perlmutter/s14-mar12/f30.F2000SCREAMv1.ne30_ne30.s14-mar12.gnugpu.2s.n001a1x1.DEBUG.Hremap512.K0def.G4.WSM/bld/e3sm.exe() [0x24c0668]
0: /pscratch/sd/n/ndk/e3sm_scratch/perlmutter/s14-mar12/f30.F2000SCREAMv1.ne30_ne30.s14-mar12.gnugpu.2s.n001a1x1.DEBUG.Hremap512.K0def.G4.WSM/bld/e3sm.exe() [0x24c4b69]
0: /pscratch/sd/n/ndk/e3sm_scratch/perlmutter/s14-mar12/f30.F2000SCREAMv1.ne30_ne30.s14-mar12.gnugpu.2s.n001a1x1.DEBUG.Hremap512.K0def.G4.WSM/bld/e3sm.exe() [0x26d06cc]
0: /pscratch/sd/n/ndk/e3sm_scratch/perlmutter/s14-mar12/f30.F2000SCREAMv1.ne30_ne30.s14-mar12.gnugpu.2s.n001a1x1.DEBUG.Hremap512.K0def.G4.WSM/bld/e3sm.exe() [0x26dafce]
0: /pscratch/sd/n/ndk/e3sm_scratch/perlmutter/s14-mar12/f30.F2000SCREAMv1.ne30_ne30.s14-mar12.gnugpu.2s.n001a1x1.DEBUG.Hremap512.K0def.G4.WSM/bld/e3sm.exe() [0x26dab26]
0: MPICH Notice [Rank 0] [job id 1568580.0] [Thu Mar 17 16:13:38 2022] [nid001241] - Abort(-1) (rank 0 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, -1) - process 0
0:
0: Kokkos::Cuda ERROR: Failed to call Kokkos::Cuda::finalize()
0:
0: Program received signal SIGSEGV: Segmentation fault - invalid memory reference.
0:
0: Backtrace for this error:
0: #0  0x14cd42cad3df in ???
0: #1  0x276480f in _ZNKSt14default_deleteIN5Homme4Impl11holder_baseEEclEPS2_
0:      at /opt/cray/pe/gcc/10.3.0/snos/include/g++/bits/unique_ptr.h:85
0: #2  0x275135f in _ZNSt10unique_ptrIN5Homme4Impl11holder_baseESt14default_deleteIS2_EED2Ev
0:      at /opt/cray/pe/gcc/10.3.0/snos/include/g++/bits/unique_ptr.h:361
0: #3  0x2740e49 in _ZN5Homme3anyD2Ev
0:      at /pscratch/sd/n/ndk/wacmy/s14-mar12/components/homme/src/share/cxx/utilities/StdMeta.hpp:68
0: #4  0x2790de3 in _ZNSt4pairIKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEN5Homme3anyEED2Ev
0:      at /opt/cray/pe/gcc/10.3.0/snos/include/g++/bits/stl_pair.h:211
0: #5  0x2790e0f in _ZN9__gnu_cxx13new_allocatorISt13_Rb_tree_nodeISt4pairIKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEN5Homme3anyEEEE7destroyISC_EEvPT_
0:      at /opt/cray/pe/gcc/10.3.0/snos/include/g++/ext/new_allocator.h:156
0: #6  0x2788ec7 in _ZNSt16allocator_traitsISaISt13_Rb_tree_nodeISt4pairIKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEN5Homme3anyEEEEE7destroyISB_EEvRSD_PT_
0:      at /opt/cray/pe/gcc/10.3.0/snos/include/g++/bits/alloc_traits.h:531
0: #7  0x27805da in _ZNSt8_Rb_treeINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESt4pairIKS5_N5Homme3anyEESt10_Select1stISA_ESt4lessIS5_ESaISA_EE15_M_destroy_nodeEPSt13_Rb_tree_nodeISA_E
0:      at /opt/cray/pe/gcc/10.3.0/snos/include/g++/bits/stl_tree.h:646
0: #8  0x27726e6 in _ZNSt8_Rb_treeINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESt4pairIKS5_N5Homme3anyEESt10_Select1stISA_ESt4lessIS5_ESaISA_EE12_M_drop_nodeEPSt13_Rb_tree_nodeISA_E
0:      at /opt/cray/pe/gcc/10.3.0/snos/include/g++/bits/stl_tree.h:654
0: #9  0x29d4dff in _ZNSt8_Rb_treeINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESt4pairIKS5_N5Homme3anyEESt10_Select1stISA_ESt4lessIS5_ESaISA_EE8_M_eraseEPSt13_Rb_tree_nodeISA_E
0:      at /opt/cray/pe/gcc/10.3.0/snos/include/g++/bits/stl_tree.h:1921
0: #10  0x29d4d6b in _ZNSt8_Rb_treeINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESt4pairIKS5_N5Homme3anyEESt10_Select1stISA_ESt4lessIS5_ESaISA_EE5clearEv
0:      at /opt/cray/pe/gcc/10.3.0/snos/include/g++/bits/stl_tree.h:1261
0: #11  0x29d4c21 in _ZNSt3mapINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEN5Homme3anyESt4lessIS5_ESaISt4pairIKS5_S7_EEE5clearEv
0:      at /opt/cray/pe/gcc/10.3.0/snos/include/g++/bits/stl_map.h:1134
0: #12  0x29d4697 in _ZN5Homme7Context5clearEv
0:      at /pscratch/sd/n/ndk/wacmy/s14-mar12/components/homme/src/share/cxx/Context.cpp:12
0: #13  0x29d470e in _ZN5Homme7Context18finalize_singletonEv
0:      at /pscratch/sd/n/ndk/wacmy/s14-mar12/components/homme/src/share/cxx/Context.cpp:21
0: #14  0x27ae294 in _ZN6scream26DynamicsDrivenGridsManagerD2Ev
0:      at /pscratch/sd/n/ndk/wacmy/s14-mar12/components/scream/src/dynamics/homme/dynamics_driven_grids_manager.cpp:74
0: #15  0x5f0d68 in ???
0: #16  0x5f0688 in ???
0: #17  0x5f0296 in ???
0: #18  0x5d4d46 in _ZNSt16_Sp_counted_baseILN9__gnu_cxx12_Lock_policyE2EE10_M_releaseEv
0:      at /opt/cray/pe/gcc/10.3.0/snos/include/g++/bits/shared_ptr_base.h:158
0: #19  0x5cee94 in _ZNSt14__shared_countILN9__gnu_cxx12_Lock_policyE2EED2Ev
0:      at /opt/cray/pe/gcc/10.3.0/snos/include/g++/bits/shared_ptr_base.h:733
0: #20  0x5cbcc1 in _ZNSt12__shared_ptrIN6scream12GridsManagerELN9__gnu_cxx12_Lock_policyE2EED2Ev
0:      at /opt/cray/pe/gcc/10.3.0/snos/include/g++/bits/shared_ptr_base.h:1183
0: #21  0x5cbcdd in _ZNSt10shared_ptrIN6scream12GridsManagerEED2Ev
0:      at /opt/cray/pe/gcc/10.3.0/snos/include/g++/bits/shared_ptr.h:121
0: #22  0x5f081d in ???
0: #23  0x5f0859 in ???
0: #24  0x5f0448 in ???
0: #25  0x5ef4be in ???
0: #26  0x5d4d46 in _ZNSt16_Sp_counted_baseILN9__gnu_cxx12_Lock_policyE2EE10_M_releaseEv
0:      at /opt/cray/pe/gcc/10.3.0/snos/include/g++/bits/shared_ptr_base.h:158
0: #27  0x5cee94 in _ZNSt14__shared_countILN9__gnu_cxx12_Lock_policyE2EED2Ev
0:      at /opt/cray/pe/gcc/10.3.0/snos/include/g++/bits/shared_ptr_base.h:733
0: #28  0x5da6db in ???
0: #29  0x5da713 in ???
0: #30  0x5e31c3 in ???
0: #31  0x5e31eb in ???
0: #32  0x5ef733 in ???
0: #33  0x5d4d46 in _ZNSt16_Sp_counted_baseILN9__gnu_cxx12_Lock_policyE2EE10_M_releaseEv
0:      at /opt/cray/pe/gcc/10.3.0/snos/include/g++/bits/shared_ptr_base.h:158
0: #34  0x5cee94 in _ZNSt14__shared_countILN9__gnu_cxx12_Lock_policyE2EED2Ev
0:      at /opt/cray/pe/gcc/10.3.0/snos/include/g++/bits/shared_ptr_base.h:733
0: #35  0x5da5f7 in _ZNSt12__shared_ptrIN4ekat3any11holder_baseELN9__gnu_cxx12_Lock_policyE2EED2Ev
0:      at /opt/cray/pe/gcc/10.3.0/snos/include/g++/bits/shared_ptr_base.h:1183
0: #36  0x5e1c3f in _ZNSt10shared_ptrIN4ekat3any11holder_baseEED2Ev
0:      at /opt/cray/pe/gcc/10.3.0/snos/include/g++/bits/shared_ptr.h:121
0: #37  0x5e1c5b in _ZN4ekat3anyD2Ev
0:      at /pscratch/sd/n/ndk/wacmy/s14-mar12/externals/ekat/src/ekat/std_meta/ekat_std_any.hpp:25
0: #38  0x5e2d3d in ???
0: #39  0x5e2d5d in ???
0: #40  0x5dfd5c in ???
0: #41  0x5dd500 in ???
0: #42  0x5d9f84 in ???
0: #43  0x5d5fdb in ???
0: #44  0x5d5fb8 in ???
0: #45  0x5cfe7f in ???
0: #46  0x5cddc7 in ???
0: #47  0x5cddff in ???
0: #48  0x14cd42caff77 in ???
0: #49  0x14cd42caffc9 in ???
0: #50  0x14cd46468e4c in ???
0: #51  0x14cd45f277a5 in ???
0: #52  0x14cd448b2f27 in ???
0: #53  0x24e12ba in piodie
0:      at /pscratch/sd/n/ndk/wacmy/s14-mar12/externals/scorpio/src/clib/pioc_support.c:426
0: #54  0x24e163c in check_netcdf
0:      at /pscratch/sd/n/ndk/wacmy/s14-mar12/externals/scorpio/src/clib/pioc_support.c:547
0: #55  0x25274cb in pio_read_darray_nc_serial
0:      at /pscratch/sd/n/ndk/wacmy/s14-mar12/externals/scorpio/src/clib/pio_darray_int.c:1504
0: #56  0x2521c6b in PIOc_read_darray
0:      at /pscratch/sd/n/ndk/wacmy/s14-mar12/externals/scorpio/src/clib/pio_darray.c:2187
0: #57  0x24c0667 in read_darray_internal_double
0:      at /pscratch/sd/n/ndk/wacmy/s14-mar12/externals/scorpio/src/flib/piodarray.F90.in:415
0: #58  0x24c4b68 in __piodarray_MOD_read_darray_1d_double
0:      at /pscratch/sd/n/ndk/wacmy/s14-mar12/externals/scorpio/src/flib/piodarray.F90.in:390
0: #59  0x26d06cb in __scream_scorpio_interface_MOD_grid_read_darray_1d
0:      at /pscratch/sd/n/ndk/wacmy/s14-mar12/components/scream/src/share/io/scream_scorpio_interface.F90:1366
0: #60  0x26dafcd in grid_read_data_array_c2f
0:      at /pscratch/sd/n/ndk/wacmy/s14-mar12/components/scream/src/share/io/scream_scorpio_interface_iso_c2f.F90:311
0: #61  0x26dab25 in _ZN6scream7scorpio20grid_read_data_arrayERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES8_iPv
0:      at /pscratch/sd/n/ndk/wacmy/s14-mar12/components/scream/src/share/io/scream_scorpio_interface.cpp:133
0: #62  0x26e7074 in _ZN6scream15AtmosphereInput14read_variablesEi
0:      at /pscratch/sd/n/ndk/wacmy/s14-mar12/components/scream/src/share/io/scorpio_input.cpp:235
0: #63  0x3147428 in _ZN6scream3spa12SPAFunctionsIdN6Kokkos6DeviceINS2_4CudaENS2_9CudaSpaceEEEE25update_spa_data_from_fileERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEiiiRNS7_14SPAHorizInterpERNS7_7SPADataE
0:      at /pscratch/sd/n/ndk/wacmy/s14-mar12/components/scream/src/physics/spa/spa_functions_impl.hpp:576
0: #64  0x3143795 in _ZN6scream3spa12SPAFunctionsIdN6Kokkos6DeviceINS2_4CudaENS2_9CudaSpaceEEEE20update_spa_timestateERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEiiRKNS_4util9TimeStampERNS7_14SPAHorizInterpERNS7_12SPATimeSta\
teERNS7_7SPADataESP_
0:      at /pscratch/sd/n/ndk/wacmy/s14-mar12/components/scream/src/physics/spa/spa_functions_impl.hpp:697
0: #65  0x313a212 in _ZN6scream3SPA15initialize_implENS_7RunTypeE
0:      at /pscratch/sd/n/ndk/wacmy/s14-mar12/components/scream/src/physics/spa/atmosphere_prescribed_aerosol.cpp:165
0: #66  0x318c598 in _ZN6scream17AtmosphereProcess10initializeERKNS_4util9TimeStampENS_7RunTypeE
0:      at /pscratch/sd/n/ndk/wacmy/s14-mar12/components/scream/src/share/atm_process/atmosphere_process.cpp:20
0: #67  0x319c706 in _ZN6scream22AtmosphereProcessGroup15initialize_implENS_7RunTypeE
0:      at /pscratch/sd/n/ndk/wacmy/s14-mar12/components/scream/src/share/atm_process/atmosphere_process_group.cpp:139
0: #68  0x318c598 in _ZN6scream17AtmosphereProcess10initializeERKNS_4util9TimeStampENS_7RunTypeE
0:      at /pscratch/sd/n/ndk/wacmy/s14-mar12/components/scream/src/share/atm_process/atmosphere_process.cpp:20
0: #69  0x2565a52 in _ZN6scream7control16AtmosphereDriver20initialize_atm_procsEv
0:      at /pscratch/sd/n/ndk/wacmy/s14-mar12/components/scream/src/control/atmosphere_driver.cpp:760
0: #70  0x5c2762 in operator()
0:      at /pscratch/sd/n/ndk/wacmy/s14-mar12/components/scream/src/mct_coupling/scream_cxx_f90_interface.cpp:194
0: #71  0x5c7408 in fpe_guard_wrapper<scream_init_atm(int const&, int const&)::<lambda()> >
0:      at /pscratch/sd/n/ndk/wacmy/s14-mar12/components/scream/src/mct_coupling/scream_cxx_f90_interface.cpp:50
0: #72  0x5c27bc in scream_init_atm
0:      at /pscratch/sd/n/ndk/wacmy/s14-mar12/components/scream/src/mct_coupling/scream_cxx_f90_interface.cpp:170
0: #73  0x5bf30b in __atm_comp_mct_MOD_atm_init_mct
0:      at /pscratch/sd/n/ndk/wacmy/s14-mar12/components/scream/src/mct_coupling/atm_comp_mct.F90:152
0: #74  0x444279 in __component_mod_MOD_component_init_cc
0:      at /pscratch/sd/n/ndk/wacmy/s14-mar12/driver-mct/main/component_mod.F90:248
0: #75  0x42c03f in __cime_comp_mod_MOD_cime_init
0:      at /pscratch/sd/n/ndk/wacmy/s14-mar12/driver-mct/main/cime_comp_mod.F90:1425
0: #76  0x43d44b in cime_driver
0:      at /pscratch/sd/n/ndk/wacmy/s14-mar12/driver-mct/main/cime_driver.F90:122
0: #77  0x43d587 in main
0:      at /pscratch/sd/n/ndk/wacmy/s14-mar12/driver-mct/main/cime_driver.F90:23
0: WARNING: SPA Remap File has been set to 'NONE', assuming that SPA data and simulation are on the same grid - skipping horizontal interpolation
srun: error: nid001241: task 0: Segmentation fault
srun: launch/slurm: _step_signal: Terminating StepId=1568580.0

Trying again with CPU-only and I get the same error. And when I try with AQUA planet (and using correct netcdf files), I also see same error with CPU build.

bartgol commented 2 years ago

That portion of the stack is just coming from the stack unwinding (with all stuff being destroyed). It might be helpful to see the whole stack at this point.

However, given that frames <=14 are during stack unwind, I would assume 15 and up are where the error happens. But 15 is ???, which is odd/not-helpful. But maybe higher stack levels show more? If they don't this might be the stack from an exception caught while unwinding the stack of another exception, which makes debugging harder. :/

PeterCaldwell commented 2 years ago

Cool! And if you're "trying again with CPU-only", does this mean you've gotten this far with GPU and ne30?

My OOMs are at the same point - just after the "M = ..." line. If you look above this chunk of output, it says timestep = 0, right?

ndkeen commented 2 years ago

In the above comment, I edited to a) show more of the stack from GPU DEBUG run as Luca requested and b) update to show I actually get same error with CPU DEBUG attempt (previously I had made a mistake).

Peter: There is a line in output 0: nstep= 0 time= 0.0000000000000000 [s], but I think where this is stopping is still in init. Could it be the SPA file? Can we try to run without SPA?

bartgol commented 2 years ago

Recording this here. I ran

./scripts/create_test  SMS_D_Ld1_P180x1.ne30_ne30.F2000SCREAMv1 --no-build --compiler gnu9

on mappy. I did the input mods suggested in this confluence page, and grabbed the ne30 IC files from Perlmutter (they are not in the ANL server, so case.submit could not fetch them). I had to xmlchange the PE layout, since it was defaulting to 180 tasks, so I changed it to 64. I also got

e3sm.exe: /some/path/PpmRemap.hpp:535: Homme::Remap::Ppm::PpmVertRemap<boundaries>::compute_partitions(...<very long stuff>): Assertion `fabs(m_pio(kv.ie, igp, jgp, NUM_PHYSICAL_LEV) - m_pin(kv.ie, igp, jgp, NUM_PHYSICAL_LEV)) < 1.0' failed.
PeterCaldwell commented 2 years ago

Luca - maybe you already know this, but your error is the same on Noel got at the very top of this thread.

PeterCaldwell commented 2 years ago

Just as an FYI, when I run the full model standalone case in a debugger, I see that it is failing at the start of timestep 1 in interpolation within compute_gas_taus in rrtmgp_sw.

When I run the full standalone model test without spa, I also get fails but I'm not sure they're at the same place.

ndkeen commented 2 years ago

If I try 1 MPI on CPU or GPU, I see the PIO Start+count exceeds error. When I run with 2 or more MPI's on CPU or GPU (still 1 node), then I can repeat the error that I originally found which is the same as what Luca sees. I'm not sure why I see a diff error with 1 MPI.

With 1 MPI, I know that it is failing while reading: /global/cfs/cdirs/e3sm/inputdata/atm/scream/init/spa_file_unified_and_complete_ne30_scream_cdf5.nc

and I see that it reads vars: hyam,hybm,PS, and fails trying to read CCN3 with the Start+count exceeds dimension bound error.

With 2 MPI's, reads same file and completes all of the vars: hyam, hybm, PS, CCN3, AER_G_SW, AER_SSA_SW, AER_TAU_SW, AER_TAU_LW (in fact rank0 reads these vars twice) before hitting assert in compute_partitions(). So clearly gets further with 2 MPI's but that doesn't make sense to me.

I wonder if in serial, there is largest chance for integer overflow on any sort of size. With 2 MPI's, might be enough to avoid.

For the 2-MPI case, the assert fail is here:

        // This is here to allow an entire block of k                                                                                                                                                 
        // threads to run in the remapping phase. It makes                                                                                                                                            
        // sure there's an old interface value below the                                                                                                                                              
        // domain that is larger.                                                                                                                                                                     
        assert(fabs(m_pio(kv.ie, igp, jgp, NUM_PHYSICAL_LEV) -
                    m_pin(kv.ie, igp, jgp, NUM_PHYSICAL_LEV)) < 1.0);

but I'm not sure how to debug further.

ndkeen commented 2 years ago

I tried to run some cases without SPA by modifying the YAML input/output files. With 2 MPI's I still see the same assert failure as above. And with 1 MPI, I now see the same assert failure. So my issue with reading in SPA file with 1 MPI can be avoided by turning off SPA, but clearly the assert error is not SPA related.

0: Atmosphere step = 0; model time = 0001-01-01 00:00:00
0: e3sm.exe: /pscratch/sd/n/ndk/wacmy/s14-mar12/components/homme/src/share/cxx/PpmRemap.hpp:539: 
Homme::Remap::Ppm::PpmVertRemap<boundaries>::compute_partitions<Homme::Remap::Ppm::PpmLimitedExtrap>::<lambda(const int&)>::<lambda()>: 
Assertion `fabs(m_pio(kv.ie, igp, jgp, NUM_PHYSICAL_LEV) - m_pin(kv.ie, igp, jgp, NUM_PHYSICAL_LEV)) < 1.0' failed.

Note to run without SPA, I edited run/data/scream_input.yaml:

Change this:

Atmosphere Driver:
  Atmosphere Processes:
    Number of Entries: 6
    Schedule Type: Sequential
    Process 0:
      Process Name: Homme
      Enable Output Fields Checks: false
      Enable Input Fields Checks: false
      Grid: Dynamics
      Vertical Coordinate Filename: foo.nc
      Moisture: moist
    Process 1:
      Process Name: SHOC
      Enable Output Fields Checks: false
      Enable Input Fields Checks: false
      Grid: Physics GLL
    Process 2:
      Process Name: CldFraction
      Enable Output Fields Checks: false
      Enable Input Fields Checks: false
      Grid: Physics GLL
    Process 3:
      Process Name: SPA
      Enable Output Fields Checks: false
      Enable Input Fields Checks: false
      Grid: Physics GLL
      SPA Remap File: none
      SPA Data File: /global/cfs/cdirs/e3sm/inputdata/atm/scream/init/spa_file_unified_and_complete_ne30_scream_cdf5.nc
    Process 4:
      Process Name: P3
      Enable Output Fields Checks: false
      Enable Input Fields Checks: false
      Grid: Physics GLL
    Process 5:

to this:

Atmosphere Driver:
  Atmosphere Processes:
    Number of Entries: 5
    Schedule Type: Sequential
    Process 0:
      Process Name: Homme
      Enable Output Fields Checks: false
      Enable Input Fields Checks: false
      Grid: Dynamics
      Vertical Coordinate Filename: foo.nc
      Moisture: moist
    Process 1:
      Process Name: SHOC
      Enable Output Fields Checks: false
      Enable Input Fields Checks: false
      Grid: Physics GLL
    Process 2:
      Process Name: CldFraction
      Enable Output Fields Checks: false
      Enable Input Fields Checks: false
      Grid: Physics GLL
    Process 3:
      Process Name: P3
      Enable Output Fields Checks: false
      Enable Input Fields Checks: false
      Grid: Physics GLL
    Process 4:

and also add some variable init to zero, in same input file change this:

  Initial Conditions:
    Physics GLL:
      Filename: blah.nc
      T_mid_prev: T_mid
      horiz_winds_prev: horiz_winds
      w_int: 0.0
      w_int_prev: 0.0

to this:

  Initial Conditions:
    Physics GLL:
      Filename: blah.nc
      T_mid_prev: T_mid
      horiz_winds_prev: horiz_winds
      w_int: 0.0
      w_int_prev: 0.0
      aero_g_sw: 0.0
      aero_ssa_sw: 0.0
      aero_tau_sw: 0.0
      aero_tau_lw: 0.0

And may need remove output vars in run/data/scream_output.yaml:

Remove this:

      - aero_g_sw
      - aero_ssa_sw
      - aero_tau_lw
      - aero_tau_sw
ndkeen commented 2 years ago

Turns out these ne30 attempts default to use 72 vertical levels and some of the input files we were trying to read have 128 vertical levels.

login31% grep SCREAM_NUM_VERTICAL_LEV env_build.xml 
    <entry id="SCREAM_CMAKE_OPTIONS" value="SCREAM_NP 4 SCREAM_NUM_VERTICAL_LEV 72 SCREAM_NUM_TRACERS 35">

Temporary location for a 72 level file is here: (temporary as Ben may need to make some changes) /global/cfs/cdirs/e3sm/bhillma/scream/data/init/screami_ne30np4L72_20220317.nc

Should only be used on non aqua planet compset, ie --compset F2000SCREAMv1

This file would replace two locations in the scream_input.yaml file pointing to: /global/cfs/cdirs/e3sm/inputdata/atm/scream/init/homme_shoc_cld_spa_p3_rrtmgp_init_ne30np4.nc

ndkeen commented 2 years ago

With the new 72-level file, my CPU-only, DEBUG build, non-aqua compset, 2-MPI test now fails here:

0:       M =   0.100468080169703E+05 kg/m^2  0.985206069036937E+05mb     
0: 
0: Program received signal SIGSEGV: Segmentation fault - invalid memory reference.
0: 
0: Backtrace for this error:
0: #0  0x146ce589b3df in ???
0: #1  0x2f00939 in operator()
0:      at /pscratch/sd/n/ndk/wacmy/s14-mar12/components/scream/src/physics/rrtmgp/atmosphere_radiation.cpp:476
0: #2  0x2f067b9 in parallel_for<int, scream::RRTMGPRadiation::run_impl(int)::<lambda(const MemberType&)>::<lambda(int const&)>, Kokkos::Impl::HostThreadTeamMember<Kokkos::Serial> >
0:      at /pscratch/sd/n/ndk/wacmy/s14-mar12/externals/ekat/extern/kokkos/core/src/impl/Kokkos_HostThreadTeam.hpp:842
0: #3  0x2f01596 in operator()
0:      at /pscratch/sd/n/ndk/wacmy/s14-mar12/components/scream/src/physics/rrtmgp/atmosphere_radiation.cpp:473
0: #4  0x2f08d8e in exec<void>
0:      at /pscratch/sd/n/ndk/wacmy/s14-mar12/externals/ekat/extern/kokkos/core/src/Kokkos_Serial.hpp:951
0: #5  0x2f079ed in execute
0:      at /pscratch/sd/n/ndk/wacmy/s14-mar12/externals/ekat/extern/kokkos/core/src/Kokkos_Serial.hpp:979
0: #6  0x2f068e4 in parallel_for<Kokkos::TeamPolicy<Kokkos::Serial>, scream::RRTMGPRadiation::run_impl(int)::<lambda(const MemberType&)> >
0:      at /pscratch/sd/n/ndk/wacmy/s14-mar12/externals/ekat/extern/kokkos/core/src/Kokkos_Parallel.hpp:142
0: #7  0x2f04968 in _ZN6scream15RRTMGPRadiation8run_implEi
0:      at /pscratch/sd/n/ndk/wacmy/s14-mar12/components/scream/src/physics/rrtmgp/atmosphere_radiation.cpp:417
0: #8  0x2f9f225 in _ZN6scream17AtmosphereProcess3runEi
0:      at /pscratch/sd/n/ndk/wacmy/s14-mar12/components/scream/src/share/atm_process/atmosphere_process.cpp:46
0: #9  0x2fae8b7 in _ZN6scream22AtmosphereProcessGroup14run_sequentialEd
0:      at /pscratch/sd/n/ndk/wacmy/s14-mar12/components/scream/src/share/atm_process/atmosphere_process_group.cpp:158
0: #10  0x2fae7c7 in _ZN6scream22AtmosphereProcessGroup8run_implEi
0:      at /pscratch/sd/n/ndk/wacmy/s14-mar12/components/scream/src/share/atm_process/atmosphere_process_group.cpp:145
0: #11  0x2f9f225 in _ZN6scream17AtmosphereProcess3runEi
0:      at /pscratch/sd/n/ndk/wacmy/s14-mar12/components/scream/src/share/atm_process/atmosphere_process.cpp:46
0: #12  0x25591c6 in _ZN6scream7control16AtmosphereDriver3runEi
0:      at /pscratch/sd/n/ndk/wacmy/s14-mar12/components/scream/src/control/atmosphere_driver.cpp:802
0: #13  0x5c15eb in operator()
0:      at /pscratch/sd/n/ndk/wacmy/s14-mar12/components/scream/src/mct_coupling/scream_cxx_f90_interface.cpp:204
0: #14  0x5c1f15 in fpe_guard_wrapper<scream_run(const Real&)::<lambda()> >
0:      at /pscratch/sd/n/ndk/wacmy/s14-mar12/components/scream/src/mct_coupling/scream_cxx_f90_interface.cpp:50
0: #15  0x5c160e in scream_run
0:      at /pscratch/sd/n/ndk/wacmy/s14-mar12/components/scream/src/mct_coupling/scream_cxx_f90_interface.cpp:201
0: #16  0x5bd6a4 in __atm_comp_mct_MOD_atm_run_mct
0:      at /pscratch/sd/n/ndk/wacmy/s14-mar12/components/scream/src/mct_coupling/atm_comp_mct.F90:209
0: #17  0x43ec7c in __component_mod_MOD_component_run
PeterCaldwell commented 2 years ago

Ok, this is at least approximately the same place I'm seeing fails. I'm envious of your stack traces - I have to run a debugger to see that level of specification!

PeterCaldwell commented 2 years ago

And this is occurring in the first timestep, right?

ndkeen commented 2 years ago

I see Atmosphere step = 0; model time = 0001-01-01 00:00:00 printed, so I'm assuming I'm in the first step.

The 1-MPI test without SPA ran out of time (after 30 min) and submitting for longer. But was also at Atmosphere step = 0; model time = 0001-01-01 00:00:00. Resubmittng and only needed 31.5 minutes to see the following error. Similar to above stack trace, but may be slightly different.

0:       M =   0.100468080169703E+05 kg/m^2  0.985206069036937E+05mb
0:
0: Program received signal SIGSEGV: Segmentation fault - invalid memory reference.
0:
0: Backtrace for this error:
0: #0  0x150452b8d3df in ???
0: #1  0x2f21233 in operator()
0:      at /pscratch/sd/n/ndk/wacmy/s14-mar12/components/scream/src/physics/rrtmgp/scream_rrtmgp_interface.cpp:348
0: #2  0x2f24f7d in parallel_for_cpu_serial<scream::rrtmgp::rrtmgp_sw(int, int, GasOpticsRRTMGP&, real2d&, real2d&, real2d&, real2d&, GasConcs&, real2d&, real2d&, real1d&, OpticalPr\
ops2str&, OpticalProps2str&, FluxesByband&, bool)::<lambda(int, int, int)> >
0:      at /pscratch/sd/n/ndk/wacmy/s14-mar12/externals/YAKL/YAKL_parallel_for_fortran.h:662
0: #3  0x2f24984 in parallel_for<scream::rrtmgp::rrtmgp_sw(int, int, GasOpticsRRTMGP&, real2d&, real2d&, real2d&, real2d&, GasConcs&, real2d&, real2d&, real1d&, OpticalProps2str&, O\
pticalProps2str&, FluxesByband&, bool)::<lambda(int, int, int)>, 3, false>
0:      at /pscratch/sd/n/ndk/wacmy/s14-mar12/externals/YAKL/YAKL_parallel_for_fortran.h:880
0: #4  0x2f21f0d in _ZN6scream6rrtmgp9rrtmgp_swEiiR15GasOpticsRRTMGPRN4yakl5ArrayIdLi2ELi1ELi2EEES6_S6_S6_R8GasConcsS6_S6_RNS4_IdLi1ELi1ELi2EEER16OpticalProps2strSC_R12FluxesBybandb
0:      at /pscratch/sd/n/ndk/wacmy/s14-mar12/components/scream/src/physics/rrtmgp/scream_rrtmgp_interface.cpp:346
0: #5  0x2f20904 in _ZN6scream6rrtmgp11rrtmgp_mainEiiRN4yakl5ArrayIdLi2ELi1ELi2EEES4_S4_S4_R8GasConcsS4_S4_RNS2_IdLi1ELi1ELi2EEES4_S4_S4_S4_RNS2_IdLi3ELi1ELi2EEESA_SA_SA_S4_S4_S4_S4\
_S4_SA_SA_SA_SA_SA_b
0:      at /pscratch/sd/n/ndk/wacmy/s14-mar12/components/scream/src/physics/rrtmgp/scream_rrtmgp_interface.cpp:246
0: #6  0x2f0504d in _ZN6scream15RRTMGPRadiation8run_implEi
0:      at /pscratch/sd/n/ndk/wacmy/s14-mar12/components/scream/src/physics/rrtmgp/atmosphere_radiation.cpp:537
0: #7  0x2f9f2e5 in _ZN6scream17AtmosphereProcess3runEi
0:      at /pscratch/sd/n/ndk/wacmy/s14-mar12/components/scream/src/share/atm_process/atmosphere_process.cpp:46
0: #8  0x2fae977 in _ZN6scream22AtmosphereProcessGroup14run_sequentialEd
0:      at /pscratch/sd/n/ndk/wacmy/s14-mar12/components/scream/src/share/atm_process/atmosphere_process_group.cpp:158
0: #9  0x2fae887 in _ZN6scream22AtmosphereProcessGroup8run_implEi
0:      at /pscratch/sd/n/ndk/wacmy/s14-mar12/components/scream/src/share/atm_process/atmosphere_process_group.cpp:145
0: #10  0x2f9f2e5 in _ZN6scream17AtmosphereProcess3runEi
0:      at /pscratch/sd/n/ndk/wacmy/s14-mar12/components/scream/src/share/atm_process/atmosphere_process.cpp:46
0: #11  0x2559286 in _ZN6scream7control16AtmosphereDriver3runEi
0:      at /pscratch/sd/n/ndk/wacmy/s14-mar12/components/scream/src/control/atmosphere_driver.cpp:802
0: #12  0x5c15eb in operator()
0:      at /pscratch/sd/n/ndk/wacmy/s14-mar12/components/scream/src/mct_coupling/scream_cxx_f90_interface.cpp:204
0: #13  0x5c1f15 in fpe_guard_wrapper<scream_run(const Real&)::<lambda()> >
0:      at /pscratch/sd/n/ndk/wacmy/s14-mar12/components/scream/src/mct_coupling/scream_cxx_f90_interface.cpp:50
0: #14  0x5c160e in scream_run
0:      at /pscratch/sd/n/ndk/wacmy/s14-mar12/components/scream/src/mct_coupling/scream_cxx_f90_interface.cpp:201
0: #15  0x5bd6a4 in __atm_comp_mct_MOD_atm_run_mct
0:      at /pscratch/sd/n/ndk/wacmy/s14-mar12/components/scream/src/mct_coupling/atm_comp_mct.F90:209

In components/scream/src/physics/rrtmgp/scream_rrtmgp_interface.cpp

            printf("ndk rrtmgp_sw() nbnd=%d nlay=%d ncol=%d ngpt=%d ngas=%d\n", nbnd, nlay, ncol, ngpt, ngas);
            printf("ndk rrtmgp_sw() totElems() flux_up=%d flux_dn=%d flux_dn_dir=%d\n", flux_up.totElems(), flux_dn.totElems(), flux_dn_dir.totElems());
            printf("ndk rrtmgp_sw() totElems() bnd_flux_up=%d bnd_flux_dn=%d bnd_flux_dn_dir=%d\n", bnd_flux_up.totElems(), bnd_flux_dn.totElems(), bnd_flux_dn_dir.totElems());

            parallel_for(Bounds<3>(nbnd,nlay+1,ncol), YAKL_LAMBDA(int ibnd, int ilev, int icol) {
                bnd_flux_up    (icol,ilev,ibnd) = 0;
                bnd_flux_dn    (icol,ilev,ibnd) = 0; // ndk fails here                                                                                                                                                 
                bnd_flux_dn_dir(icol,ilev,ibnd) = 0;
            });

on rank0 (out of 4 in this case where I added prints):

0: ndk rrtmgp_sw() nbnd=14 nlay=72 ncol=12367 ngpt=224 ngas=8
0: ndk rrtmgp_sw() totElems() flux_up=902791 flux_dn=902791 flux_dn_dir=902791
0: ndk rrtmgp_sw() totElems() bnd_flux_up=12639074 bnd_flux_dn=12639074 bnd_flux_dn_dir=12639074

Number of elements in those containers seem ok.

One thing I noticed is that we need a larger type to handle the size of memory in RRTMGPRadiation::init_buffers, though it may only be used for assert:

0: ndk RRTMGPRadiation::init_buffers mymem=   395814688 after 2d
0: ndk RRTMGPRadiation::init_buffers mymem=   594499664 after 3d arrays
0: ndk RRTMGPRadiation::init_buffers mymem= -2110987824 after 3d nswbands
0: ndk RRTMGPRadiation::init_buffers mymem= -1202713648 after 3d nlwbands
0: ndk RRTMGPRadiation::init_buffers mymem= -1191826800 after 3d surf alb
0: ndk RRTMGPRadiation::init_buffers mymem=   431868816 after 3d aero -- all
0: ndk RRTMGPRadiation::init_buffers used_mem=431868816

But maybe not as this overflowed used_mem value is equal to requested_buffer_size_in_bytes(). This assert was not being tripped as both values are using int and will overflow in same way.

  int used_mem = (reinterpret_cast<Real*>(mem) - buffer_manager.get_memory())*sizeof(Real);
  EKAT_REQUIRE_MSG(used_mem==requested_buffer_size_in_bytes(), "Error! Used memory != requested memory for RRTMGPRadiation.");
ndkeen commented 2 years ago

When I try on the GPU, again with 1 MPI and no spa:

0: Create Pool
0: Create Pool
0: Create Pool
0: Create Pool
0: Create Pool
0: Create Pool
0: ERROR: Trying to allocate 140725987852512 bytes, but the current pool is too small, and growSize is only 1073741824 bytes. Thus, the allocation will never fit in pool memory.
0: You need to increase GATOR_GROW_MB and probably GATOR_INITIAL_MB as well

Looks like it's in about the same location as non-gpu.

ndkeen commented 2 years ago

If I make changes to use a larger integer type for all memory sizes, I am able to run 2 steps of ne30! This is still without SPA and CPU-only, but I tested with DEBUG and optimized* with 1 and 4 MPI's. With GPU, I still get an error regarding memory size and gators like above comment.

0: ERROR: Trying to allocate 1823588352 bytes, but the current pool is too small, and growSize is only 1073741824 bytes. Thus, the allocation will never fit in pool memory.
0: You need to increase GATOR_GROW_MB and probably GATOR_INITIAL_MB as well

I created issue regarding the integer overflows of size vars here: https://github.com/E3SM-Project/scream/issues/1492

PeterCaldwell commented 2 years ago

Wow @ndkeen , you're my hero! "2 steps at ne30" is music to my ears... Thanks so much for your effort on this...

Are you saying that running with SPA still doesn't work, or that you haven't tried it yet?

ndkeen commented 2 years ago

Right I just hadn't tried it. Trying now...

With SPA I see:

0: IMEX max iterations, error:  0  0.000000000000000E+00  0.000000000000000E+00
0: PIO: FATAL ERROR: Aborting... FATAL ERROR: NetCDF: One or more variable sizes violate format constraints (/pscratch/sd/n/ndk/wacmy/s14-mar12/externals/scorpio/src/clib/pioc_support.c: 3373)

Adding some prints, it has trouble in PIO_enddef call:

0: ndk check_netcdf() file=SCREAMv1_output_hydrostatic.INSTANT.Steps_x2.np8.0001-01-01.001000.nc status=0
0:  ndk calling PIO_enddef with file=SCREAMv1_output_hydrostatic.INSTANT.Steps_x2.np8.0001-01-01.001000.nc or=SCREAMv1_output_hydrostatic.INSTANT.Steps_x2.np8.0001-01-01.001000.nc
0: ndk check_netcdf() file=SCREAMv1_output_hydrostatic.INSTANT.Steps_x2.np8.0001-01-01.001000.nc status=-62
0: PIO: FATAL ERROR: Aborting... FATAL ERROR: NetCDF: One or more variable sizes violate format constraints (/pscratch/sd/n/ndk/wacmy/s14-mar12/externals/scorpio/src/clib/pioc_support.c: 3376)
0: Obtained 10 stack frames.
0: /pscratch/sd/n/ndk/e3sm_scratch/perlmutter/s14-mar12/f30cpu.F2000SCREAMv1.ne30_ne30.s14-mar12.gnu.2s.n002a4x8.DEBUG.Hremap512.K0def.WSM.L72ic/bld/e3sm.exe() [0x24e7841]

...
0:      at /pscratch/sd/n/ndk/wacmy/s14-mar12/externals/scorpio/src/clib/pioc_support.c:426
0: #58  0x24e7dd5 in check_netcdf
0:      at /pscratch/sd/n/ndk/wacmy/s14-mar12/externals/scorpio/src/clib/pioc_support.c:550
0: #59  0x24ef9ac in pioc_change_def
0:      at /pscratch/sd/n/ndk/wacmy/s14-mar12/externals/scorpio/src/clib/pioc_support.c:3376
0: #60  0x250d483 in PIOc_enddef
0:      at /pscratch/sd/n/ndk/wacmy/s14-mar12/externals/scorpio/src/clib/pio_nc.c:2601
0: #61  0x243e232 in __pio_nf_MOD_enddef_id
0:      at /pscratch/sd/n/ndk/wacmy/s14-mar12/externals/scorpio/src/flib/pio_nf.F90:654
0: #62  0x243e252 in __pio_nf_MOD_enddef_desc
0:      at /pscratch/sd/n/ndk/wacmy/s14-mar12/externals/scorpio/src/flib/pio_nf.F90:638
0: #63  0x26b2724 in __scream_scorpio_interface_MOD_eam_pio_enddef
0:      at /pscratch/sd/n/ndk/wacmy/s14-mar12/components/scream/src/share/io/scream_scorpio_interface.F90:262
0: #64  0x26b315c in eam_pio_enddef_c2f
0:      at /pscratch/sd/n/ndk/wacmy/s14-mar12/components/scream/src/share/io/scream_scorpio_interface_iso_c2f.F90:258
0: #65  0x26b2dd7 in _ZN6scream7scorpio14eam_pio_enddefERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE
0:      at /pscratch/sd/n/ndk/wacmy/s14-mar12/components/scream/src/share/io/scream_scorpio_interface.cpp:123
0: #66  0x26b63ff in _ZN6scream13OutputManager3runERKNS_4util9TimeStampE
0:      at /pscratch/sd/n/ndk/wacmy/s14-mar12/components/scream/src/share/io/scream_output_manager.cpp:229
0: #67  0x256c86d in _ZN6scream7control16AtmosphereDriver3runEi
0:      at /pscratch/sd/n/ndk/wacmy/s14-mar12/components/scream/src/control/atmosphere_driver.cpp:809
0: #68  0x5c1a4b in operator()
0:      at /pscratch/sd/n/ndk/wacmy/s14-mar12/components/scream/src/mct_coupling/scream_cxx_f90_interface.cpp:204
0: #69  0x5c2375 in fpe_guard_wrapper<scream_run(const Real&)::<lambda()> >
0:      at /pscratch/sd/n/ndk/wacmy/s14-mar12/components/scream/src/mct_coupling/scream_cxx_f90_interface.cpp:50
0: #70  0x5c1a6e in scream_run
0:      at /pscratch/sd/n/ndk/wacmy/s14-mar12/components/scream/src/mct_coupling/scream_cxx_f90_interface.cpp:201
0: #71  0x5bdb04 in __atm_comp_mct_MOD_atm_run_mct
...
ndkeen commented 2 years ago

I noticed that the output files written by scream are in classic format which might cause an issue. I'm not sure exactly what the above error is about, but if I comment out a section in scream input to NOT write output, I can run 2 steps. I created https://github.com/E3SM-Project/scream/issues/1494

PeterCaldwell commented 2 years ago

So it sounds like if we don't write output we can run ne30 (at least on CPUs) and we know SCREAMv1 is somehow defaulting to 32 bit netcdf "classic" which could very well be the problem (though it seems like ne30 shouldn't produce datasets of overly large size???). But we don't know how to convince v1 to write 64 bit data to check this. @AaronDonahue made a git issue about this #1495 but it would be nice to get this fixed asap...