Running with INFO_DBUG=3 fails due to atm fields

jonbob commented 1 month ago

Running E3SM with INFO_DBUG set to 3 fails for all cases with active atm. This was confirmed to be true for:

multiple platforms (at least pm-cpu and chrysalis)
multiple atm meshes (ne4pg2 and ne30pg2)
both debug and non-debug
F-cases and B-cases

The error message and traceback point to a bad value being passed to reprosum:

19: SHR_REPROSUM_CALC: Input contains  0.00000E+00 NaNs and  0.12000E+02 INFs on MPI task      19
19:  ERROR: shr_reprosum_calc ERROR: NaNs or INFs in input
19: Image              PC                Routine            Line        Source             
19: libpnetcdf.so.3.0  000015554B93C12A  tracebackqq_          Unknown  Unknown
19: e3sm.exe           0000000010D6A918  shr_abort_mod_mp_         114  shr_abort_mod.F90
19: e3sm.exe           0000000010D6A776  shr_abort_mod_mp_          61  shr_abort_mod.F90
19: e3sm.exe           0000000010E920D6  shr_reprosum_mod_         644  shr_reprosum_mod.F90
19: e3sm.exe           000000000075610E  seq_diag_mct_mp_s        2329  seq_diag_mct.F90
19: e3sm.exe           00000000004A8EBF  component_mod_mp_         973  component_mod.F90
19: e3sm.exe           0000000000445D1D  cime_comp_mod_mp_        2103  cime_comp_mod.F90
19: e3sm.exe           0000000000489F04  MAIN__                    122  cime_driver.F90
19: e3sm.exe           0000000000427E22  Unknown               Unknown  Unknown
19: libc-2.28.so       000015554503D493  __libc_start_main     Unknown  Unknown
19: e3sm.exe           0000000000427D2E  Unknown               Unknown  Unknown

during cpl calculations summing fields from the atm to the cpl. If the BFBFLAG is flipped the reprosum call is replaced by local sums and it shows:

comm_diag xxx sorr   1 0.0000000000000000000E+00 recv IC atm Sa_z
comm_diag xxx sorr   2 0.0000000000000000000E+00 recv IC atm Sa_topo
comm_diag xxx sorr   3 0.0000000000000000000E+00 recv IC atm Sa_u
comm_diag xxx sorr   4 0.0000000000000000000E+00 recv IC atm Sa_v
comm_diag xxx sorr   5 0.0000000000000000000E+00 recv IC atm Sa_ugust
comm_diag xxx sorr   6                  Infinity recv IC atm Sa_tbot
comm_diag xxx sorr   7 0.0000000000000000000E+00 recv IC atm Sa_ptem
comm_diag xxx sorr   8 0.0000000000000000000E+00 recv IC atm Sa_shum
comm_diag xxx sorr   9 0.0000000000000000000E+00 recv IC atm Sa_pbot
comm_diag xxx sorr  10 0.0000000000000000000E+00 recv IC atm Sa_dens
comm_diag xxx sorr  11 0.0000000000000000000E+00 recv IC atm Sa_uovern
comm_diag xxx sorr  12 0.0000000000000000000E+00 recv IC atm Sa_pslv
comm_diag xxx sorr  13 0.0000000000000000000E+00 recv IC atm Sa_co2prog
comm_diag xxx sorr  14 0.0000000000000000000E+00 recv IC atm Sa_co2diag
comm_diag xxx sorr  15 0.0000000000000000000E+00 recv IC atm Faxa_rainc
comm_diag xxx sorr  16 0.0000000000000000000E+00 recv IC atm Faxa_rainl
comm_diag xxx sorr  17 0.0000000000000000000E+00 recv IC atm Faxa_snowc
comm_diag xxx sorr  18 0.0000000000000000000E+00 recv IC atm Faxa_snowl
comm_diag xxx sorr  19 0.0000000000000000000E+00 recv IC atm Faxa_lwdn
comm_diag xxx sorr  20 0.0000000000000000000E+00 recv IC atm Faxa_swndr
comm_diag xxx sorr  21 0.0000000000000000000E+00 recv IC atm Faxa_swvdr
comm_diag xxx sorr  22 0.0000000000000000000E+00 recv IC atm Faxa_swndf
comm_diag xxx sorr  23 0.0000000000000000000E+00 recv IC atm Faxa_swvdf
comm_diag xxx sorr  24 0.0000000000000000000E+00 recv IC atm Faxa_swnet

So somehow INFs are being passed from EAM to the cpl for field Sa_tbot

crterai commented 1 month ago

Hi @jonbob, I'm going to try to reproduce this crash at ne4pg2. Do these crashes occur quite quickly into a simulation?

jonbob commented 1 month ago

Sorry not to be more clear -- the failures are right after or maybe during initialization while processing the first atm-to-cpl fields

crterai commented 1 month ago

Just a quick update: I have been able to reproduce this error by running an F2010 ne4 case on pm-cpu with SMS.ne4pg2_oQU480.F2010.pm-cpu_intel and setting INFO_DBUG=2 or higher with ./xmlchange INFO_DBUG=2. Tests with INFO_DBUG=1 or lower don't produce this crash and this occurs for both regular and debug (_D) simulations. Because it's happening during initialization, the flags that I typically use (e.g., state_debug_checks) don't help with diagnosing the issue.

And as @jonbob noted previously, despite Sa_tbot being the variable that gets reported to have an infinity value, the IC file doesn't show any bad temperature-related values (all within reasonable bounds).

One difference that I can see in the current IC file (eami_mam4_Linoz_ne4np4_L80_c20231010.nc) compared to what was used previously (cami_mam3_Linoz_ne4np4_L72_c160909.nc) is that the following variables are now on the pg2, rather than the np4 grid: ICEFRAC, SICTHK, SNOWHICE, TS1, TS2, TS3, TS4, and TSICE. I'm guessing that this is not the reason since the coupling with the other components occur on the pg2 grid, but noting it here in case it rings any bells for people.

Curious if @wlin7 or @singhbalwinder have seen a similar type of error in the past.

singhbalwinder commented 1 month ago

This is an interesting find! I have never tried running the model with a non-default INFO_DBUG. From the name of the flag, it sounds like it is for printing out additional debug info. I don't know why it would change the code in a way that crashes the model. From a simple grep, it is not obvious where this flag is being used in the F90 codes (all I get are the namelist and XML files).

rljacob commented 1 month ago

INFO_DBUG in the xml becomes info_debug in the drv_in namelist.

rljacob commented 1 month ago

All the extra info is triggered by "info_debug > 1" so there should be no difference between 2 and 3. IIRC this was used in early cpl7 development.

jonbob commented 1 week ago

Any update, @crterai ?

crterai commented 1 week ago

Hi @jonbob, Sorry, I haven't made progress on this. I'd tried a couple other tests but came up empty before I was out of the office for vacation. I have one thing lead related to the IC file, and will give it a try this week. If that doesn't work, I'll try to bisect the failure. Related to this, do you have any info on when this INFO_DBUG=3 worked most recently in the code?

jonbob commented 1 week ago

Thanks @crterai. I have to apologize -- I don't know when this last worked. If you end up having to bisect, I'll check and see if I can find a starting point

wlin7 commented 1 week ago

@jonbob and all, any hint at which stage of initialization the following were computed and sent to cpl

comm_diag xxx sorr   1 0.0000000000000000000E+00 recv IC atm Sa_z
comm_diag xxx sorr   2 0.0000000000000000000E+00 recv IC atm Sa_topo
comm_diag xxx sorr   3 0.0000000000000000000E+00 recv IC atm Sa_u
comm_diag xxx sorr   4 0.0000000000000000000E+00 recv IC atm Sa_v
comm_diag xxx sorr   5 0.0000000000000000000E+00 recv IC atm Sa_ugust
comm_diag xxx sorr   6                  Infinity recv IC atm Sa_tbot
comm_diag xxx sorr   7 0.0000000000000000000E+00 recv IC atm Sa_ptem

As all the fields except for Tbot are 0s, it seems T was still not initialized but the others are. In that case, as initialization proceeds, and the full T array get initialized, the ensuing calculations should be fine, as long as tbot values at that stage are not used in real calculation in any component, which is likely so just like the other variables (that are summed to 0s).

rljacob commented 1 week ago

This is during the driver/coupler init which is also when each model does its init.

EAM has 2 init phases and the above "Infinity" appears in the first one. The diags from the second init phase have ok values for Sa_tbot.

component_init_cc:mct) : Initialize component atm
comm_diag xxx sorr   1 6.4214335112857160000E+15 recv IC2 atm Sa_z
comm_diag xxx sorr   2 0.0000000000000000000E+00 recv IC2 atm Sa_topo
comm_diag xxx sorr   3-1.6982918828672656250E+13 recv IC2 atm Sa_u
comm_diag xxx sorr   4-2.4347597627818921875E+14 recv IC2 atm Sa_v
comm_diag xxx sorr   5 1.8817479641314200000E+14 recv IC2 atm Sa_ugust
comm_diag xxx sorr   6 1.4535153334464899200E+17 recv IC2 atm Sa_tbot
comm_diag xxx sorr   7 1.4541381644267897600E+17 recv IC2 atm Sa_ptem
comm_diag xxx sorr   8 4.7200746125261816406E+12 recv IC2 atm Sa_shum
comm_diag xxx sorr   9 5.0194152656744964096E+19 recv IC2 atm Sa_pbot

And so does the rest of the run.

It is curious why Sa_tbot and only Sa_tbot has "Inf" in it but that value is never used and the field has good values sent to the coupler by the time the init phase finishes and the model is running.

rljacob commented 1 week ago

EAM loads the data to send to the coupler in atm_export in eam/src/cpl/atm_import_export.F90 and this is the line that copies from EAM data structure to coupler: a2x(index_a2x_Sa_tbot ,ig) = cam_out(c)%tbot(i) The first call to atm_export occurs in cam_init_mct in the "if (first_time) then" block. https://github.com/E3SM-Project/E3SM/blob/0e0df9e3d59e287953d7cde945582ecbe239bf36/components/eam/src/cpl/atm_comp_mct.F90#L404

A few lines above that is the call to cam_init.

wlin7 commented 1 week ago

Thanks @rljacob . Your explanation suggests the issue is due to the code flow, independent of atm IC and case and run types (F or B), but has no consequence. Likely atm_export has not been called when the first sums are produced. And not unexpected, the problem can be re-produced using v3.LR.historical configuration.

rljacob commented 1 week ago

atm_export has been called. The sequence is:

cpl:component_init_cc (atm)
    eam:atm_init_mct (first_time)
         eam: timmgr_init
         eam: cam_init
         eam: atm_export
cpl: component_diag "recv IC atm"  <-- bad value for Sa_tbot
cpl:  compute albedos
cpl: component_exch(atm) with comm_diag "send atm"
cpl:component_init_cc(atm)
   eam:atm_init_mct(not first_time)
        eam: atm_import
        eam: cam_run1
        eam: atm_export
cpl:component_exch(atm) with comm_diag "recv IC2 atm"   <--good value for Sa_tbot

wlin7 commented 1 week ago

Ok, I see. During my check, I didn't realize component_init_cc (atm) would proceed to call atm_init_mct.

ndkeen commented 1 week ago

With INFO_DBUG=2: I tried with BFBFLAG=FALSE and indeed it continues to run (no fail).

And then back to BFBFLAG=TRUE: Note I also tried with/without DEBUG as well as with intel/gnu -- same error. Finally, to rule out MPI issue, I tried with P1x1 and see same issue. And with 1 MPI, I ran thru gdb and see this path:

#26 0x00007ffff3132478 in PMPI_Abort () from /opt/cray/pe/mpich/8.1.25/ofi/intel/19.0/lib/libmpi_intel.so.12
#27 0x00007ffff564469d in pmpi_abort__ () from /opt/cray/pe/mpich/8.1.25/ofi/intel/19.0/lib/libmpifort_intel.so.12
#28 0x000000000cd0ae60 in shr_mpi_mod::shr_mpi_abort (string=..., rcode=<error reading variable: Cannot access memory at address 0x0>, .tmp.STRING.len_V$1646=46) at /global/cfs/cdirs/e3sm/ndk/repos/me25-apr15/share/util/shr_mpi_mod.F90:2127
#29 0x000000000cc0e117 in shr_abort_mod::shr_abort_abort (string=..., rc=<error reading variable: Cannot access memory at address 0x0>, .tmp.STRING.len_V$22=46) at /global/cfs/cdirs/e3sm/ndk/repos/me25-apr15/share/util/shr_abort_mod.F90:69
#30 0x000000000cd648db in shr_reprosum_mod::shr_reprosum_calc (arr=<error reading variable: value requires 116736 bytes, which is more than max-value-size>, arr_gsum=..., nsummands=384, dsummands=384, nflds=38, allow_infnan=<error reading variable: Cannot access memory at address 0x0>, 
    ddpdd_sum=<error reading variable: Cannot access memory at address 0x0>, arr_gbl_max=<error reading variable: value requires 1125899906247040 bytes, which is more than max-value-size>, arr_gbl_max_out=<error reading variable: Cannot access memory at address 0x0>, arr_max_levels=..., arr_max_levels_out=..., 
    gbl_max_nsummands=<error reading variable: Cannot access memory at address 0x0>, gbl_max_nsummands_out=<error reading variable: Cannot access memory at address 0x0>, gbl_count=<error reading variable: Cannot access memory at address 0x0>, repro_sum_validate=<error reading variable: Cannot access memory at addre
ss 0x0>, 
    repro_sum_stats=<error reading variable: Cannot access memory at address 0x0>, rel_diff=..., commid=-2080374781) at /global/cfs/cdirs/e3sm/ndk/repos/me25-apr15/share/util/shr_reprosum_mod.F90:644
#31 0x0000000000758079 in seq_diag_mct::seq_diag_avect_mct (infodata=..., id=2, av=..., dom=0x1cf0a8c0, gsmap=0x1cf0aea0, comment=..., .tmp.COMMENT.len_V$44bd=267) at /global/cfs/cdirs/e3sm/ndk/repos/me25-apr15/driver-mct/main/seq_diag_mct.F90:2329
#32 0x00000000004ba82c in component_mod::component_diag (infodata=..., comp=..., flow=..., comment=..., info_debug=2, timer_diag=<error reading variable: value requires 226387296 bytes, which is more than max-value-size>, .tmp.FLOW.len_V$79f4=3, .tmp.COMMENT.len_V$79f6=11, .tmp.TIMER_DIAG.len_V$79fa=0)
    at /global/cfs/cdirs/e3sm/ndk/repos/me25-apr15/driver-mct/main/component_mod.F90:973
#33 0x0000000000457942 in cime_comp_mod::cime_init () at /global/cfs/cdirs/e3sm/ndk/repos/me25-apr15/driver-mct/main/cime_comp_mod.F90:2145
#34 0x000000000049ceaa in cime_driver () at /global/cfs/cdirs/e3sm/ndk/repos/me25-apr15/driver-mct/main/cime_driver.F90:122

so that we see who is calling that reprosum if that helps.

Does it make sense to set some vars to 0 somewhere just to get around this?

wlin7 commented 1 week ago

Tracking the code flow following the calling sequence that @rljacob outlined above, focusing on the 1st call to atm_init_mct, it is comfirmed cam_out%tbot values are Infinity during the 1st call of atm_export. Further tracing the call down stream

eam:atm_init_mct (first_time)
         eam: timmgr_init
         eam: cam_init (cam_out, ...)
                   atm2hub_alloc (allocate cam_out, including cam_out%tbot and properly initialize all to 0s)
                   ...
                   phys_init (cam_out, ...)
                            phys_inidat (cam_out,...)
                                     cam_out(lchnk)%tbot(:) = posinf
         .......
         eam: atm_export

Clearly, after tbot are initialized to 0s together with other variables inside atm2hub_alloc, tbot alone are re-assigned with posinf (infinity) inside phys_inidat. Don't know if anyone can recall the exact reason. We can assume it is purposely assigned unphysical values, anticipating to be updated during actual calculations. Only that it would cause problem when doing global sum diagnostic for IC data (when the atm comp has actually only completed stage 1 init).

The solution is to either not call component_diag(infodata, atm, flow='c2x', comment='recv IC atm' , or not to assign unphysical value poinf to tbot in physpkg.F90::phys_init::phys_inidat. The latter is preferred. Zero values for tbot are already unphysical, after all.

A fix PR will be put it after having further feedback from the team. Given how it plays out, it is clear this has no consequence on simulation results.

E3SM-Project / E3SM

Running with INFO_DBUG=3 fails due to atm fields #6307