E3SM-Project / scream

Fork of E3SM used to develop exascale global atmosphere model written in C++
https://e3sm-project.github.io/scream/
Other
77 stars 55 forks source link

Problems running without SPA in recent scream master #1625

Closed ndkeen closed 2 years ago

ndkeen commented 2 years ago

With ne120, I was hitting an error running without SPA that looked like same issue we see when we do not set the aero_ vars to 0. However, when I try to make a simple reproducer with ne4, I'm getting a diff issue, and may be doing something else wrong, so just trying to document a little.

Using master of May 9th, and trying SMS_D_Ld1_P4x1.ne4_ne4.F2000SCREAMv1.perlmutter_gnugpu (the test will hit runtime error due to team size too large, but that's a diff issue)

I then modify namelist scream like so:

login20% diff namelist_scream.xml-orig namelist_scream.xml-nospa 
13a14,17
>     <aero_g_sw>0.0</aero_g_sw>
>     <aero_ssa_sw>0.0</aero_ssa_sw>
>     <aero_tau_sw>0.0</aero_tau_sw>
>     <aero_tau_lw>0.0</aero_tau_lw>
87c91
<       <atm_procs_list type="string">(shoc,cldFraction,spa,p3)</atm_procs_list>
---
>       <atm_procs_list type="string">(shoc,cldFraction,p3)</atm_procs_list>

Also need to remove these vars from run/data/scream_output.yaml

      - aero_g_sw
      - aero_ssa_sw
      - aero_tau_lw
      - aero_tau_sw

Error:

0:   [EAMXX] Run  start time stamp: 0001-01-01.000000
0:   [EAMXX] Case start time stamp: 0001-01-01.000000
0:   [EAMXX] set_initial_conditions ...
0: MPICH ERROR [Rank 0] [job id 2142769.0] [Mon May  9 11:43:49 2022] [nid001925] - Abort(-46) (rank 0 in comm 432): application called MPI_Abort(comm=0xC4000006, -46) - process 0
0: 
0: terminate called after throwing an instance of 'std::runtime_error'
0:   what():  Kokkos allocation "F_Phi" is being deallocated after Kokkos::finalize was called
...
0: #8  0x1488dc95885d in _Unwind_Resume
0:      at ../../../cpe-gcc-11.2.0-202108140355.9bf1fd589a5c1/libgcc/unwind.inc:241
0: #9  0x3319199 in _ZN6Kokkos4Impl22SharedAllocationRecordIvvE9decrementEPS2_
0:      at /pscratch/sd/n/ndk/wacmy/s19-may9/externals/ekat/extern/kokkos/core/src/impl/Kokkos_SharedAlloc.cpp:240
0: #10  0x27a4287 in _ZN6Kokkos4Impl23SharedAllocationTrackerD4Ev
0:      at /pscratch/sd/n/ndk/wacmy/s19-may9/externals/ekat/extern/kokkos/core/src/impl/Kokkos_SharedAlloc.hpp:494
0: #11  0x27a4287 in _ZN6Kokkos4Impl11ViewTrackerINS_4ViewIA4_A4_A73_PN13KokkosKernels7Batched12Experimental6VectorINS5_9VectorTagINS5_4SIMDIdNS_4CudaEEELi1EEEEEJNS_11LayoutRightENS_9CudaSpaceENS_12MemoryTraitsILj8EEEEEEED2Ev
0:      at /pscratch/sd/n/ndk/wacmy/s19-may9/externals/ekat/extern/kokkos/core/src/impl/Kokkos_ViewTracker.hpp:67
0: #12  0x27a42cf in _ZN6Kokkos4ViewIA4_A4_A73_PN13KokkosKernels7Batched12Experimental6VectorINS3_9VectorTagINS3_4SIMDIdNS_4CudaEEELi1EEEEEJNS_11LayoutRightENS_9CudaSpaceENS_12MemoryTraitsILj8EEEEED2Ev
0:      at /pscratch/sd/n/ndk/wacmy/s19-may9/externals/ekat/extern/kokkos/core/src/Kokkos_View.hpp:1405
0: #13  0x27a4a19 in _ZN5Homme15ElementsForcingD2Ev
0:      at /pscratch/sd/n/ndk/wacmy/s19-may9/components/homme/src/theta-l_kokkos/cxx/ElementsForcing.hpp:8
0: #14  0x29e025d in _ZN5Homme8ElementsD2Ev
0:      at /pscratch/sd/n/ndk/wacmy/s19-may9/components/homme/src/share/cxx/Elements.hpp:32
0: #15  0x2a1fe49 in _ZN9__gnu_cxx13new_allocatorIN5Homme8ElementsEE7destroyIS2_EEvPT_
0:      at /opt/cray/pe/gcc/10.3.0/snos/include/g++/ext/new_allocator.h:156
0: #16  0x2a1fc1a in _ZNSt16allocator_traitsISaIN5Homme8ElementsEEE7destroyIS1_EEvRS2_PT_
0:      at /opt/cray/pe/gcc/10.3.0/snos/include/g++/bits/alloc_traits.h:531
0: #17  0x2a1f2b4 in _ZNSt23_Sp_counted_ptr_inplaceIN5Homme8ElementsESaIS1_ELN9__gnu_cxx12_Lock_policyE2EE10_M_disposeEv
0:      at /opt/cray/pe/gcc/10.3.0/snos/include/g++/bits/shared_ptr_base.h:560
0: #18  0x5d8668 in _ZNSt16_Sp_counted_baseILN9__gnu_cxx12_Lock_policyE2EE10_M_releaseEv
0:      at /opt/cray/pe/gcc/10.3.0/snos/include/g++/bits/shared_ptr_base.h:158
0: #19  0x5d24b8 in _ZNSt14__shared_countILN9__gnu_cxx12_Lock_policyE2EED2Ev
0:      at /opt/cray/pe/gcc/10.3.0/snos/include/g++/bits/shared_ptr_base.h:733
0: #20  0x27ad747 in _ZNSt12__shared_ptrIN5Homme8ElementsELN9__gnu_cxx12_Lock_policyE2EED2Ev
0:      at /opt/cray/pe/gcc/10.3.0/snos/include/g++/bits/shared_ptr_base.h:1183
0: #21  0x27ad763 in _ZNSt10shared_ptrIN5Homme8ElementsEED2Ev
0:      at /opt/cray/pe/gcc/10.3.0/snos/include/g++/bits/shared_ptr.h:121
0: #22  0x2a1e491 in _ZN5Homme4Impl6holderINS_8ElementsEED2Ev
0:      at /pscratch/sd/n/ndk/wacmy/s19-may9/components/homme/src/share/cxx/utilities/StdMeta.hpp:32
0: #23  0x2a1e4b9 in _ZN5Homme4Impl6holderINS_8ElementsEED0Ev
0:      at /pscratch/sd/n/ndk/wacmy/s19-may9/components/homme/src/share/cxx/utilities/StdMeta.hpp:32
0: #24  0x27cc2e9 in _ZNKSt14default_deleteIN5Homme4Impl11holder_baseEEclEPS2_
0:      at /opt/cray/pe/gcc/10.3.0/snos/include/g++/bits/unique_ptr.h:85
0: #25  0x27bbf2f in _ZNSt10unique_ptrIN5Homme4Impl11holder_baseESt14default_deleteIS2_EED2Ev
0:      at /opt/cray/pe/gcc/10.3.0/snos/include/g++/bits/unique_ptr.h:361
0: #26  0x27aca91 in _ZN5Homme3anyD2Ev
0:      at /pscratch/sd/n/ndk/wacmy/s19-may9/components/homme/src/share/cxx/utilities/StdMeta.hpp:68
0: #27  0x27f01a1 in _ZNSt4pairIKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEN5Homme3anyEED2Ev
0:      at /opt/cray/pe/gcc/10.3.0/snos/include/g++/bits/stl_pair.h:211
0: #28  0x27f01cd in _ZN9__gnu_cxx13new_allocatorISt13_Rb_tree_nodeISt4pairIKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEN5Homme3anyEEEE7destroyISC_EEvPT_
0:      at /opt/cray/pe/gcc/10.3.0/snos/include/g++/ext/new_allocator.h:156
0: #29  0x27e98a1 in _ZNSt16allocator_traitsISaISt13_Rb_tree_nodeISt4pairIKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEN5Homme3anyEEEEE7destroyISB_EEvRSD_PT_
0:      at /opt/cray/pe/gcc/10.3.0/snos/include/g++/bits/alloc_traits.h:531
0: #30  0x27e327a in _ZNSt8_Rb_treeINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESt4pairIKS5_N5Homme3anyEESt10_Select1stISA_ESt4lessIS5_ESaISA_EE15_M_destroy_nodeEPSt13_Rb_tree_nodeISA_E
0:      at /opt/cray/pe/gcc/10.3.0/snos/include/g++/bits/stl_tree.h:646
0: #31  0x27d82a6 in _ZNSt8_Rb_treeINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESt4pairIKS5_N5Homme3anyEESt10_Select1stISA_ESt4lessIS5_ESaISA_EE12_M_drop_nodeEPSt13_Rb_tree_nodeISA_E
0:      at /opt/cray/pe/gcc/10.3.0/snos/include/g++/bits/stl_tree.h:654
0: #32  0x2a32da7 in _ZNSt8_Rb_treeINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESt4pairIKS5_N5Homme3anyEESt10_Select1stISA_ESt4lessIS5_ESaISA_EE8_M_eraseEPSt13_Rb_tree_nodeISA_E
0:      at /opt/cray/pe/gcc/10.3.0/snos/include/g++/bits/stl_tree.h:1921
0: #33  0x2a32d84 in _ZNSt8_Rb_treeINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESt4pairIKS5_N5Homme3anyEESt10_Select1stISA_ESt4lessIS5_ESaISA_EE8_M_eraseEPSt13_Rb_tree_nodeISA_E
...
bartgol commented 2 years ago

The error

0: terminate called after throwing an instance of 'std::runtime_error'
0:   what():  Kokkos allocation "F_Phi" is being deallocated after Kokkos::finalize was called

is never the real error. This appears when the execution terminated without properly deallocating all views. A "healthy" execution ends with 1) deallocate views, 2) finalize kokkos. When an error happens, we finalize kokkos immediately, and then (indirectly) call exit. The call to exit cleans up some vars with static storage, which include the Homme context, which stores a bunch of Views. When those views are destroyed, kokkos complains that Kokkos::finalize was already called.

Given the exception handling mechanism, even looking at the stack might give a wrong impression about where the error happened. The actual error happened when

MPICH ERROR [Rank 0] [job id 2142769.0] [Mon May  9 11:43:49 2022] [nid001925] - Abort(-46) (rank 0 in comm 432): application called MPI_Abort(comm=0xC4000006, -46) 

was printed. Unfortunately, the only way to find out where that happened is by running through gdb/cuda-gdg.

ndkeen commented 2 years ago

Can you reproduce?

bartgol commented 2 years ago

Sorry, I haven't had time to tackle this. I will try to run it tomorrow.

bartgol commented 2 years ago

Yes, I can reproduce on mappy:

[EAMXX] create_fields ... done!
[EAMXX] initialize_fields ... 
  [EAMXX] Run  start time stamp: 0001-01-01.000000
  [EAMXX] Case start time stamp: 0001-01-01.000000
  [EAMXX] set_initial_conditions ... 
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI COMMUNICATOR 7 CREATE FROM 4
with errorcode -46.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.

I will debug a bit.

ndkeen commented 2 years ago

OK thanks, makes me feel better I was doing something wrong. I think it would be good to be able to run without SPA, but not sure of the overall priority -- may be situation where we could use it now to debug memory issues and then not need it again...

bartgol commented 2 years ago

It seems it errors out while trying to read aero_g_sw. I forgot to add the IC for those fields, so I'm atmchange-ing and trying again.

bartgol commented 2 years ago

Upon modification of the XML file (*), the test now runs correctly for me (it's at the 2nd step, going slowly due to mappy busy with AT runs).

(*) atmchange does not allow to add entries. Since the hard coding of fields to a value was not present in the initial XML file, atmchange won't work (I want to add that feature, but maybe not too urgent). So I had to manually modify the XML file.

PeterCaldwell commented 2 years ago

Great sleuthing, Luca!!!

Noel - can you confirm that you can run without SPA by directly modifying the namelist_scream.xml file to add the 4 variable initializations?

ndkeen commented 2 years ago

Now wait, do you mean how I noted in the first comment?

PeterCaldwell commented 2 years ago

Argh, yeah. Luca's "atmchange won't work because these vars aren't in scream_defaults.xml" explanation made so much sense that I assumed it must be right. But how is it that he is able to initialize these vars and you aren't? Maybe a subtle difference here?

ndkeen commented 2 years ago

So then Luca cannot reproduce my issue?

bartgol commented 2 years ago

To be clear, this is where I added those lines in the XML:

  <Initial__Conditions>
    <Filename type="string">/sems-data-store/ACME/inputdata/atm/scream/init/init_ne4np4.nc</Filename>
    <Restart__Casename type="string">SMS_D_Ld1_P4x1.ne4_ne4.F2000SCREAMv1.mappy_gnu9.20220510_154748_qjy99f</Restart__Casename>
    <Physics__GLL>
      <aero_g_sw>0.0</aero_g_sw>
      <aero_ssa_sw>0.0</aero_ssa_sw>
      <aero_tau_sw>0.0</aero_tau_sw>
      <aero_tau_lw>0.0</aero_tau_lw>
    </Physics__GLL>
  </Initial__Conditions>

Notice that they are indented inside the Physics__GLL section. I'm not sure if that's what you had (can't tell from the snippet you posted).

ndkeen commented 2 years ago
  <Initial__Conditions>
    <Filename type="string">/global/cfs/cdirs/e3sm/inputdata/atm/scream/init/init_ne4np4.nc</Filename>
    <Restart__Casename type="string">SMS_D_Ld1_P4x1.ne4_ne4.F2000SCREAMv1.perlmutter_gnugpu.20220509_103948_f7cpfn</Restart__Casename>
    <aero_g_sw>0.0</aero_g_sw>
    <aero_ssa_sw>0.0</aero_ssa_sw>
    <aero_tau_sw>0.0</aero_tau_sw>
    <aero_tau_lw>0.0</aero_tau_lw>
  </Initial__Conditions>
bartgol commented 2 years ago

Ah! Yeah, the individial fields need to be nested within a section corresponding to the grid where they are defined.

ndkeen commented 2 years ago

Of course, I should have known these are "Physics_GLL" variables I'm setting to zero.

That seems to be working , thanks.

I can run ne30/ne120/ne512 without SPA.

bartgol commented 2 years ago

Yeah, sorry about that. I thought I mentioned it before, but I probably didn't. :/