Closed ndkeen closed 2 years ago
The error
0: terminate called after throwing an instance of 'std::runtime_error'
0: what(): Kokkos allocation "F_Phi" is being deallocated after Kokkos::finalize was called
is never the real error. This appears when the execution terminated without properly deallocating all views. A "healthy" execution ends with 1) deallocate views, 2) finalize kokkos. When an error happens, we finalize kokkos immediately, and then (indirectly) call exit. The call to exit cleans up some vars with static storage, which include the Homme context, which stores a bunch of Views. When those views are destroyed, kokkos complains that Kokkos::finalize
was already called.
Given the exception handling mechanism, even looking at the stack might give a wrong impression about where the error happened. The actual error happened when
MPICH ERROR [Rank 0] [job id 2142769.0] [Mon May 9 11:43:49 2022] [nid001925] - Abort(-46) (rank 0 in comm 432): application called MPI_Abort(comm=0xC4000006, -46)
was printed. Unfortunately, the only way to find out where that happened is by running through gdb/cuda-gdg.
Can you reproduce?
Sorry, I haven't had time to tackle this. I will try to run it tomorrow.
Yes, I can reproduce on mappy:
[EAMXX] create_fields ... done!
[EAMXX] initialize_fields ...
[EAMXX] Run start time stamp: 0001-01-01.000000
[EAMXX] Case start time stamp: 0001-01-01.000000
[EAMXX] set_initial_conditions ...
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI COMMUNICATOR 7 CREATE FROM 4
with errorcode -46.
NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
I will debug a bit.
OK thanks, makes me feel better I was doing something wrong. I think it would be good to be able to run without SPA, but not sure of the overall priority -- may be situation where we could use it now to debug memory issues and then not need it again...
It seems it errors out while trying to read aero_g_sw
. I forgot to add the IC for those fields, so I'm atmchange-ing and trying again.
Upon modification of the XML file (*), the test now runs correctly for me (it's at the 2nd step, going slowly due to mappy busy with AT runs).
(*) atmchange
does not allow to add entries. Since the hard coding of fields to a value was not present in the initial XML file, atmchange won't work (I want to add that feature, but maybe not too urgent). So I had to manually modify the XML file.
Great sleuthing, Luca!!!
Noel - can you confirm that you can run without SPA by directly modifying the namelist_scream.xml file to add the 4 variable initializations?
Now wait, do you mean how I noted in the first comment?
Argh, yeah. Luca's "atmchange won't work because these vars aren't in scream_defaults.xml" explanation made so much sense that I assumed it must be right. But how is it that he is able to initialize these vars and you aren't? Maybe a subtle difference here?
So then Luca cannot reproduce my issue?
To be clear, this is where I added those lines in the XML:
<Initial__Conditions>
<Filename type="string">/sems-data-store/ACME/inputdata/atm/scream/init/init_ne4np4.nc</Filename>
<Restart__Casename type="string">SMS_D_Ld1_P4x1.ne4_ne4.F2000SCREAMv1.mappy_gnu9.20220510_154748_qjy99f</Restart__Casename>
<Physics__GLL>
<aero_g_sw>0.0</aero_g_sw>
<aero_ssa_sw>0.0</aero_ssa_sw>
<aero_tau_sw>0.0</aero_tau_sw>
<aero_tau_lw>0.0</aero_tau_lw>
</Physics__GLL>
</Initial__Conditions>
Notice that they are indented inside the Physics__GLL
section. I'm not sure if that's what you had (can't tell from the snippet you posted).
<Initial__Conditions>
<Filename type="string">/global/cfs/cdirs/e3sm/inputdata/atm/scream/init/init_ne4np4.nc</Filename>
<Restart__Casename type="string">SMS_D_Ld1_P4x1.ne4_ne4.F2000SCREAMv1.perlmutter_gnugpu.20220509_103948_f7cpfn</Restart__Casename>
<aero_g_sw>0.0</aero_g_sw>
<aero_ssa_sw>0.0</aero_ssa_sw>
<aero_tau_sw>0.0</aero_tau_sw>
<aero_tau_lw>0.0</aero_tau_lw>
</Initial__Conditions>
Ah! Yeah, the individial fields need to be nested within a section corresponding to the grid where they are defined.
Of course, I should have known these are "Physics_GLL" variables I'm setting to zero.
That seems to be working , thanks.
I can run ne30/ne120/ne512 without SPA.
Yeah, sorry about that. I thought I mentioned it before, but I probably didn't. :/
With ne120, I was hitting an error running without SPA that looked like same issue we see when we do not set the
aero_
vars to 0. However, when I try to make a simple reproducer with ne4, I'm getting a diff issue, and may be doing something else wrong, so just trying to document a little.Using master of May 9th, and trying
SMS_D_Ld1_P4x1.ne4_ne4.F2000SCREAMv1.perlmutter_gnugpu
(the test will hit runtime error due to team size too large, but that's a diff issue)I then modify namelist scream like so:
Also need to remove these vars from
run/data/scream_output.yaml
Error: