geoschem / GCHP

The "superproject" wrapper repository for GCHP, the high-performance instance of the GEOS-Chem chemical-transport model.
https://gchp.readthedocs.io
Other
22 stars 25 forks source link

[BUG/ISSUE] Wetdep: Error in area resuspension in middle levels #262

Closed kilicomu closed 1 year ago

kilicomu commented 1 year ago

What institution are you from?

Wolfson Atmospheric Chemistry Laboratories

Description of the problem

For a while now I've not been able to run GCHP on my cluster. With MERRA 2, I get the following error when running in release mode:

===============================================================================
WETDEP: ERROR at   12  12  71 for species   59 in area RESUSPENSION in middle levels
 LS          :  T 
 PDOWN       :   0.000000000000000E+000
 QQ          :   0.000000000000000E+000
 ALPHA       :   0.000000000000000E+000
 ALPHA2      :   0.000000000000000E+000
 RAINFRAC    :   0.000000000000000E+000
 WASHFRAC    :   0.000000000000000E+000
 MASS_WASH   :   0.000000000000000E+000
 MASS_NOWASH :   0.000000000000000E+000
 WETLOSS     :   0.000000000000000E+000
 GAINED      :   0.000000000000000E+000
 LOST        :   0.000000000000000E+000
 DSpc(NW,:)  :   0.000000000000000E+000  0.000000000000000E+000
  0.000000000000000E+000  0.000000000000000E+000  0.000000000000000E+000
  0.000000000000000E+000  0.000000000000000E+000  0.000000000000000E+000
  0.000000000000000E+000  0.000000000000000E+000  0.000000000000000E+000
  0.000000000000000E+000  0.000000000000000E+000  0.000000000000000E+000
  0.000000000000000E+000  0.000000000000000E+000  0.000000000000000E+000
  0.000000000000000E+000  0.000000000000000E+000  0.000000000000000E+000
  0.000000000000000E+000  0.000000000000000E+000  0.000000000000000E+000
  0.000000000000000E+000  0.000000000000000E+000  0.000000000000000E+000
  0.000000000000000E+000  0.000000000000000E+000  0.000000000000000E+000
  0.000000000000000E+000  0.000000000000000E+000  0.000000000000000E+000
  0.000000000000000E+000  0.000000000000000E+000  0.000000000000000E+000
  0.000000000000000E+000  0.000000000000000E+000  0.000000000000000E+000
  0.000000000000000E+000  0.000000000000000E+000  0.000000000000000E+000
  0.000000000000000E+000  0.000000000000000E+000  0.000000000000000E+000
  0.000000000000000E+000  0.000000000000000E+000  0.000000000000000E+000
  0.000000000000000E+000  0.000000000000000E+000  0.000000000000000E+000
  0.000000000000000E+000  0.000000000000000E+000  0.000000000000000E+000
  0.000000000000000E+000  0.000000000000000E+000  0.000000000000000E+000
  0.000000000000000E+000  0.000000000000000E+000  0.000000000000000E+000
  0.000000000000000E+000  0.000000000000000E+000  0.000000000000000E+000
  0.000000000000000E+000  0.000000000000000E+000  0.000000000000000E+000
  0.000000000000000E+000  0.000000000000000E+000  0.000000000000000E+000
  0.000000000000000E+000  0.000000000000000E+000  0.000000000000000E+000
  0.000000000000000E+000
 Spc(I,J,:N) :                      NaN                     NaN  
                     NaN                     NaN                     NaN  
                     NaN                     NaN                     NaN  
                     NaN                     NaN                     NaN  
                     NaN                     NaN                     NaN  
                     NaN                     NaN                     NaN  
                     NaN                     NaN                     NaN  
                     NaN                     NaN                     NaN  
                     NaN                     NaN                     NaN  
                     NaN                     NaN                     NaN  
                     NaN                     NaN                     NaN  
                     NaN                     NaN                     NaN  
                     NaN                     NaN                     NaN  
                     NaN                     NaN                     NaN  
                     NaN                     NaN                     NaN  
                     NaN                     NaN                     NaN  
                     NaN                     NaN                     NaN  
                     NaN                     NaN                     NaN  
                     NaN                     NaN                     NaN  
                     NaN                     NaN                     NaN  
                     NaN                     NaN                     NaN  
                     NaN                     NaN                     NaN  
                     NaN                     NaN                     NaN  
                     NaN                     NaN                     NaN  
                     NaN  
===============================================================================

GEOS-Chem ERROR [0005]: Error encountered in wet deposition!
 --> LOCATION:  -
 > at SAFETY (in module GeosCore/wetscav_mod.F90)

GEOS-Chem ERROR [0005]: Error encountered in "Safety"!
 --> LOCATION:  -> at D 
 o_Complete_Reevap (in module GeosCore/wetscav_mod.F90)
     - DO_LINEAR_CHEM: Linearized chemistry at 2019/07/01 00:00
###############################################################################
# Interpolating Linoz fields for jul
###############################################################################
     - LINOZ_CHEM3: Doing LINOZ

GEOS-Chem ERROR [0005]:
 --> LOCATION:  -> at WetDep (in module GeosCore/wetsc
 av_mod.F90)

GEOS-Chem ERROR [0005]: Error encountered in "Wetdep"!
 --> LOCATION:  -> at D 
 o_WetDep (in module GeosCore/wetscav_mod.F90)

which looks very similar to this open GC Classic error (and some closed errors).

I know whereabouts in the code it's dying, but it's not helping me figure out what's causing the problem. I've attached four archive runs, two with a debug build and two with a default build (MERRA2 and GEOS-FP). Both archives have all DEBUG level logging turned on and have all the individual PET log files in the Logs directory alongside the GCHP log file.

The debug build with MERRA 2 run fails differently:

 HEMCO: array pointer vertically flipped relative to MAPL Import GMI_LOSS_RIPD
 HEMCO: array pointer vertically flipped relative to MAPL Import GMI_PROD_RIPD
 HEMCO: array pointer vertically flipped relative to MAPL Import GMI_LOSS_RP
 HEMCO: array pointer vertically flipped relative to MAPL Import GMI_PROD_RP
TIMEZONES (i.e. OFFSETS FROM UTC) WERE READ FROM A FILE 
  --- Do convection now
  --- Convection done!
  --- Do drydep now
      Use FULL PBL:  F
[login1:81896:0:81896] Caught signal 8 (Floating point exception: floating-point invalid operation)
  --- Drydep done!
  --- Do emissions now
HEMCO (VOLCANO): Opening /mnt/lustre/groups/chem-acm-2018/earth0_data/GEOS/ExtData/HEMCO/VOLCANO/v2021-09/2019/07/so2_volcanic_emissions_Carns.20190701.rc
HEMCO (VOLCANO): Opening /mnt/lustre/groups/chem-acm-2018/earth0_data/GEOS/ExtData/HEMCO/VOLCANO/v2021-09/2019/07/so2_volcanic_emissions_Carns.20190701.rc

/users/klcm500/scratch/GCHP/AMY_TESTING/RUNDIRS/TESTING.14.0.0/CodeDir/src/GCHP_GridComp/HEMCO_GridComp/HEMCO/src/Core/hco_calc_mod.F90: [ hco_calc_mod_mp_get_current_emissions_() ]
      ...  
     1040 #if defined( ESMF_ )
     1041           ! SDE 2017-01-07: Temporary kludge. MAPL ExtData sets missing
     1042           ! data to 1e15, but HEMCO uses a different value!
==>  1043           IF ( ( TMPVAL == HCO_MISSVAL ) .or. ( TMPVAL > 1.0e+14 ) ) THEN 
     1044 #else
     1045           IF ( TMPVAL == HCO_MISSVAL ) THEN 
     1046 #endif

==== backtrace (tid:  81896) ==== 
 0 0x0000000007240833 hco_calc_mod_mp_get_current_emissions_()  /users/klcm500/scratch/GCHP/AMY_TESTING/RUNDIRS/TESTING.14.0.0/CodeDir/src/GCHP_GridComp/HEMCO_GridComp/HEMCO/src/Core/hco_calc_mod.F90:1043

which is looks relevant to this open HEMCO issue.

With GEOS-FP, the release build of the model dies as follows in the first timestep:

 Setting history variable pointers to GC and Export States
 AGCM Date: 2019/07/01  Time: 00:10:00  Throughput(days/day)[Avg Tot Run]:      2.1      2.1     11.6  TimeRemaining(Est) 361:40:47   68.7% :  54.3% Mem Comm:Used
                                                                      Mem/Swap Used (MB) at MAPL_Cap:TimeLoop=  1.822E+05  3.764E+03
 GEOS-Chem phase           -1 :
 DoConv   :  T
 DoDryDep :  F
 DoEmis   :  F
 DoTend   :  F
 DoTurb   :  T
 DoChem   :  F
 DoWetDep :  T

  --- Do convection now
Infinity in DO_CLOUD_CONVECTION!
K, IC, Q(K):    5 169           NaN        N2O
                     NaN                     NaN                     NaN
  8.351165364691387E-009 -1.261049537314628E-009                     NaN
   300.000000000000                          NaN
Infinity in DO_CLOUD_CONVECTION!
K, IC, Q(K):    4 169           NaN        N2O
                     NaN                     NaN                     NaN
  4.156234141583976E-011 -5.585751344131670E-012                     NaN
   300.000000000000                          NaN
...
GEOS-Chem ERROR [0005]: Error encountered in "Do_Cloud_Convection"!
 --> LOCAT
 ION:  -> at Do_Convection (in module GeosCore/convection_mod.F)
pe=00005 FAIL at line=01146    gchp_chunk_mod.F90                       <Error calling DO_CONVECTION>
pe=00005 FAIL at line=03881    Chem_GridCompMod.F90                     <status=1>
pe=00005 FAIL at line=02975    Chem_GridCompMod.F90                     <status=1>
pe=00005 FAIL at line=01799    MAPL_Generic.F90                         <status=1>
pe=00005 FAIL at line=00556    GCHP_GridCompMod.F90                     <status=1>
pe=00005 FAIL at line=01799    MAPL_Generic.F90                         <status=1>
pe=00005 FAIL at line=01280    MAPL_CapGridComp.F90                     <status=1>
pe=00005 FAIL at line=01232    MAPL_CapGridComp.F90                     <status=1>
pe=00005 FAIL at line=00808    MAPL_CapGridComp.F90                     <status=1>
pe=00005 FAIL at line=00948    MAPL_CapGridComp.F90                     <status=1>
pe=00005 FAIL at line=00261    MAPL_Cap.F90                             <status=1>
pe=00005 FAIL at line=00218    MAPL_Cap.F90                             <status=1>
pe=00005 FAIL at line=00152    MAPL_Cap.F90                             <status=1>
pe=00005 FAIL at line=00129    MAPL_Cap.F90                             <status=1>
pe=00005 FAIL at line=00031    GCHPctm.F90                              <status=1>

And the same GEOS-FP run, with a debug build:

###############################################################################
# Interpolating Linoz fields for jul
###############################################################################
     - LINOZ_CHEM3: Doing LINOZ
  --- Chemistry done!
  --- Do wetdep now
  --- Wetdep done!
  --- Do diagnostics now
  --- Diagnostics done!

 Setting history variable pointers to GC and Export States

/users/klcm500/scratch/GCHP/AMY_TESTING/RUNDIRS/TESTING.14.0.0/CodeDir/src/GCHP_GridComp/GEOSChem_GridComp/geos-chem/KPP/fullchem/fullchem_HetStateFuncs.F90: [ fullchem_hetstatefuncs_mp_halide_conc_() ]
      ...  
      443     !=======================================================================
      444     ! Fraction of SALACL in total fine sea salt 
      445     !=======================================================================
==>   446     H%frac_SALACL = C(ind_SALACL) / ( C(ind_SALACL) + C(ind_NIT) + C(ind_SO4) )
      447  
      448   END SUBROUTINE Halide_Conc
      449 !EOC 

==== backtrace (tid: 173618) ==== 
 0 0x00000000000b719c __libm_pow_e7()  ???:0
 1 0x00000000010a683c ucx_mod_mp_calc_strat_aer_()  /users/klcm500/scratch/GCHP/AMY_TESTING/RUNDIRS/TESTING.14.0.0/CodeDir/src/GCHP_GridComp/GEOSChem_GridComp/geos-chem/GeosCore/ucx_mod.F90:1896
 2 0x0000000000741ba9 chemistry_mod_mp_do_chemistry_()  /users/klcm500/scratch/GCHP/AMY_TESTING/RUNDIRS/TESTING.14.0.0/CodeDir/src/GCHP_GridComp/GEOSChem_GridComp/geos-chem/GeosCore/chemistry_mod.F90:241
 3 0x000000000057424e gchp_chunk_mod_mp_gchp_chunk_run_()  /users/klcm500/scratch/GCHP/AMY_TESTING/RUNDIRS/TESTING.14.0.0/CodeDir/src/GCHP_GridComp/GEOSChem_GridComp/geos-chem/Interfaces/GCHP/gchp_chunk_mod.F90:1335
 4 0x00000000004f3e61 chem_gridcompmod_mp_run__()  /users/klcm500/scratch/GCHP/AMY_TESTING/RUNDIRS/TESTING.14.0.0/CodeDir/src/GCHP_GridComp/GEOSChem_GridComp/geos-chem/Interfaces/GCHP/Chem_GridCompMod.F90:3855
 5 0x000000000048f511 chem_gridcompmod_mp_run2_()  /users/klcm500/scratch/GCHP/AMY_TESTING/RUNDIRS/TESTING.14.0.0/CodeDir/src/GCHP_GridComp/GEOSChem_GridComp/geos-chem/Interfaces/GCHP/Chem_GridCompMod.F90:2975
 6 0x00000000095f7014 ESMCI::FTable::callVFuncPtr()  /mnt/lustre/users/klcm500/GCHP/AMY_TESTING/ESMF/src/esmf-8.3.1/src/Superstructure/Component/src/ESMCI_FTable.C:2167
 7 0x00000000095fab9f ESMCI_FTableCallEntryPointVMHop()  /mnt/lustre/users/klcm500/GCHP/AMY_TESTING/ESMF/src/esmf-8.3.1/src/Superstructure/Component/src/ESMCI_FTable.C:824
 8 0x0000000009b5d62a ESMCI::VMK::enter()  /mnt/lustre/users/klcm500/GCHP/AMY_TESTING/ESMF/src/esmf-8.3.1/src/Infrastructure/VM/src/ESMCI_VMKernel.C:2318
 9 0x0000000009b2c425 ESMCI::VM::enter()  /mnt/lustre/users/klcm500/GCHP/AMY_TESTING/ESMF/src/esmf-8.3.1/src/Infrastructure/VM/src/ESMCI_VM.C:1216
10 0x00000000095f848a c_esmc_ftablecallentrypointvm_()  /mnt/lustre/users/klcm500/GCHP/AMY_TESTING/ESMF/src/esmf-8.3.1/src/Superstructure/Component/src/ESMCI_FTable.C:981
11 0x00000000096c051d esmf_compmod_mp_esmf_compexecute_()  /mnt/lustre/users/klcm500/GCHP/AMY_TESTING/ESMF/src/esmf-8.3.1/src/Superstructure/Component/src/ESMF_Comp.F90:1222
12 0x00000000095b7dc1 esmf_gridcompmod_mp_esmf_gridcomprun_()  /mnt/lustre/users/klcm500/GCHP/AMY_TESTING/ESMF/src/esmf-8.3.1/src/Superstructure/Component/src/ESMF_GridComp.F90:1891
13 0x00000000076a3073 mapl_genericmod_mp_mapl_genericwrapper_()  /users/klcm500/scratch/GCHP/AMY_TESTING/RUNDIRS/TESTING.14.0.0/CodeDir/src/MAPL/generic/MAPL_Generic.F90:1794
14 0x00000000095f7014 ESMCI::FTable::callVFuncPtr()  /mnt/lustre/users/klcm500/GCHP/AMY_TESTING/ESMF/src/esmf-8.3.1/src/Superstructure/Component/src/ESMCI_FTable.C:2167
15 0x00000000095fab9f ESMCI_FTableCallEntryPointVMHop()  /mnt/lustre/users/klcm500/GCHP/AMY_TESTING/ESMF/src/esmf-8.3.1/src/Superstructure/Component/src/ESMCI_FTable.C:824
16 0x0000000009b5d62a ESMCI::VMK::enter()  /mnt/lustre/users/klcm500/GCHP/AMY_TESTING/ESMF/src/esmf-8.3.1/src/Infrastructure/VM/src/ESMCI_VMKernel.C:2318
17 0x0000000009b2c425 ESMCI::VM::enter()  /mnt/lustre/users/klcm500/GCHP/AMY_TESTING/ESMF/src/esmf-8.3.1/src/Infrastructure/VM/src/ESMCI_VM.C:1216
18 0x00000000095f848a c_esmc_ftablecallentrypointvm_()  /mnt/lustre/users/klcm500/GCHP/AMY_TESTING/ESMF/src/esmf-8.3.1/src/Superstructure/Component/src/ESMCI_FTable.C:981
19 0x00000000096c051d esmf_compmod_mp_esmf_compexecute_()  /mnt/lustre/users/klcm500/GCHP/AMY_TESTING/ESMF/src/esmf-8.3.1/src/Superstructure/Component/src/ESMF_Comp.F90:1222
20 0x00000000095b7dc1 esmf_gridcompmod_mp_esmf_gridcomprun_()  /mnt/lustre/users/klcm500/GCHP/AMY_TESTING/ESMF/src/esmf-8.3.1/src/Superstructure/Component/src/ESMF_GridComp.F90:1891
21 0x000000000043932d gchp_gridcompmod_mp_run_()  /users/klcm500/scratch/GCHP/AMY_TESTING/RUNDIRS/TESTING.14.0.0/CodeDir/src/GCHP_GridComp/GCHP_GridCompMod.F90:551
22 0x00000000095f7014 ESMCI::FTable::callVFuncPtr()  /mnt/lustre/users/klcm500/GCHP/AMY_TESTING/ESMF/src/esmf-8.3.1/src/Superstructure/Component/src/ESMCI_FTable.C:2167
23 0x00000000095fab9f ESMCI_FTableCallEntryPointVMHop()  /mnt/lustre/users/klcm500/GCHP/AMY_TESTING/ESMF/src/esmf-8.3.1/src/Superstructure/Component/src/ESMCI_FTable.C:824
24 0x0000000009b5d62a ESMCI::VMK::enter()  /mnt/lustre/users/klcm500/GCHP/AMY_TESTING/ESMF/src/esmf-8.3.1/src/Infrastructure/VM/src/ESMCI_VMKernel.C:2318
25 0x0000000009b2c425 ESMCI::VM::enter()  /mnt/lustre/users/klcm500/GCHP/AMY_TESTING/ESMF/src/esmf-8.3.1/src/Infrastructure/VM/src/ESMCI_VM.C:1216
26 0x00000000095f848a c_esmc_ftablecallentrypointvm_()  /mnt/lustre/users/klcm500/GCHP/AMY_TESTING/ESMF/src/esmf-8.3.1/src/Superstructure/Component/src/ESMCI_FTable.C:981
27 0x00000000096c051d esmf_compmod_mp_esmf_compexecute_()  /mnt/lustre/users/klcm500/GCHP/AMY_TESTING/ESMF/src/esmf-8.3.1/src/Superstructure/Component/src/ESMF_Comp.F90:1222
28 0x00000000095b7dc1 esmf_gridcompmod_mp_esmf_gridcomprun_()  /mnt/lustre/users/klcm500/GCHP/AMY_TESTING/ESMF/src/esmf-8.3.1/src/Superstructure/Component/src/ESMF_GridComp.F90:1891
29 0x00000000076a3073 mapl_genericmod_mp_mapl_genericwrapper_()  /users/klcm500/scratch/GCHP/AMY_TESTING/RUNDIRS/TESTING.14.0.0/CodeDir/src/MAPL/generic/MAPL_Generic.F90:1794
30 0x00000000095f7014 ESMCI::FTable::callVFuncPtr()  /mnt/lustre/users/klcm500/GCHP/AMY_TESTING/ESMF/src/esmf-8.3.1/src/Superstructure/Component/src/ESMCI_FTable.C:2167
31 0x00000000095fab9f ESMCI_FTableCallEntryPointVMHop()  /mnt/lustre/users/klcm500/GCHP/AMY_TESTING/ESMF/src/esmf-8.3.1/src/Superstructure/Component/src/ESMCI_FTable.C:824
32 0x0000000009b5d62a ESMCI::VMK::enter()  /mnt/lustre/users/klcm500/GCHP/AMY_TESTING/ESMF/src/esmf-8.3.1/src/Infrastructure/VM/src/ESMCI_VMKernel.C:2318
33 0x0000000009b2c425 ESMCI::VM::enter()  /mnt/lustre/users/klcm500/GCHP/AMY_TESTING/ESMF/src/esmf-8.3.1/src/Infrastructure/VM/src/ESMCI_VM.C:1216
34 0x00000000095f848a c_esmc_ftablecallentrypointvm_()  /mnt/lustre/users/klcm500/GCHP/AMY_TESTING/ESMF/src/esmf-8.3.1/src/Superstructure/Component/src/ESMCI_FTable.C:981
35 0x00000000096c051d esmf_compmod_mp_esmf_compexecute_()  /mnt/lustre/users/klcm500/GCHP/AMY_TESTING/ESMF/src/esmf-8.3.1/src/Superstructure/Component/src/ESMF_Comp.F90:1222
36 0x00000000095b7dc1 esmf_gridcompmod_mp_esmf_gridcomprun_()  /mnt/lustre/users/klcm500/GCHP/AMY_TESTING/ESMF/src/esmf-8.3.1/src/Superstructure/Component/src/ESMF_GridComp.F90:1891
37 0x00000000049e660b mapl_capgridcompmod_mp_step_()  /users/klcm500/scratch/GCHP/AMY_TESTING/RUNDIRS/TESTING.14.0.0/CodeDir/src/MAPL/gridcomps/Cap/MAPL_CapGridComp.F90:1277
38 0x00000000049e518a mapl_capgridcompmod_mp_run_mapl_gridcomp_()  /users/klcm500/scratch/GCHP/AMY_TESTING/RUNDIRS/TESTING.14.0.0/CodeDir/src/MAPL/gridcomps/Cap/MAPL_CapGridComp.F90:1225
39 0x00000000049d9cee mapl_capgridcompmod_mp_run_gc_()  /users/klcm500/scratch/GCHP/AMY_TESTING/RUNDIRS/TESTING.14.0.0/CodeDir/src/MAPL/gridcomps/Cap/MAPL_CapGridComp.F90:807
40 0x00000000095f7014 ESMCI::FTable::callVFuncPtr()  /mnt/lustre/users/klcm500/GCHP/AMY_TESTING/ESMF/src/esmf-8.3.1/src/Superstructure/Component/src/ESMCI_FTable.C:2167
41 0x00000000095fab9f ESMCI_FTableCallEntryPointVMHop()  /mnt/lustre/users/klcm500/GCHP/AMY_TESTING/ESMF/src/esmf-8.3.1/src/Superstructure/Component/src/ESMCI_FTable.C:824
42 0x0000000009b5d62a ESMCI::VMK::enter()  /mnt/lustre/users/klcm500/GCHP/AMY_TESTING/ESMF/src/esmf-8.3.1/src/Infrastructure/VM/src/ESMCI_VMKernel.C:2318
43 0x0000000009b2c425 ESMCI::VM::enter()  /mnt/lustre/users/klcm500/GCHP/AMY_TESTING/ESMF/src/esmf-8.3.1/src/Infrastructure/VM/src/ESMCI_VM.C:1216
44 0x00000000095f848a c_esmc_ftablecallentrypointvm_()  /mnt/lustre/users/klcm500/GCHP/AMY_TESTING/ESMF/src/esmf-8.3.1/src/Superstructure/Component/src/ESMCI_FTable.C:981
45 0x00000000096c051d esmf_compmod_mp_esmf_compexecute_()  /mnt/lustre/users/klcm500/GCHP/AMY_TESTING/ESMF/src/esmf-8.3.1/src/Superstructure/Component/src/ESMF_Comp.F90:1222
46 0x00000000095b7dc1 esmf_gridcompmod_mp_esmf_gridcomprun_()  /mnt/lustre/users/klcm500/GCHP/AMY_TESTING/ESMF/src/esmf-8.3.1/src/Superstructure/Component/src/ESMF_GridComp.F90:1891
47 0x00000000049dea4d mapl_capgridcompmod_mp_run_()  /users/klcm500/scratch/GCHP/AMY_TESTING/RUNDIRS/TESTING.14.0.0/CodeDir/src/MAPL/gridcomps/Cap/MAPL_CapGridComp.F90:946
48 0x00000000049ab436 mapl_capmod_mp_run_model_()  /users/klcm500/scratch/GCHP/AMY_TESTING/RUNDIRS/TESTING.14.0.0/CodeDir/src/MAPL/gridcomps/Cap/MAPL_Cap.F90:260
49 0x00000000049a6e11 mapl_capmod_mp_run_member_()  /users/klcm500/scratch/GCHP/AMY_TESTING/RUNDIRS/TESTING.14.0.0/CodeDir/src/MAPL/gridcomps/Cap/MAPL_Cap.F90:218
50 0x00000000049a15e3 mapl_capmod_mp_run_ensemble_()  /users/klcm500/scratch/GCHP/AMY_TESTING/RUNDIRS/TESTING.14.0.0/CodeDir/src/MAPL/gridcomps/Cap/MAPL_Cap.F90:152
51 0x00000000049a067e mapl_capmod_mp_run_()  /users/klcm500/scratch/GCHP/AMY_TESTING/RUNDIRS/TESTING.14.0.0/CodeDir/src/MAPL/gridcomps/Cap/MAPL_Cap.F90:129
52 0x000000000042b3a3 MAIN__()  /users/klcm500/scratch/GCHP/AMY_TESTING/RUNDIRS/TESTING.14.0.0/CodeDir/src/GCHPctm.F90:31
53 0x0000000000426ad2 main()  ???:0
54 0x0000000000022555 __libc_start_main()  ???:0
55 0x00000000004269e9 _start()  ???:0
=================================
forrtl: error (75): floating point exception

I have tried:

I'd appreciate another set of eyes on the issues. I guess the error is propagating along from something that I'm missing (hopefully nothing too obvious...), but I'm not sure where that might be!

GEOS-Chem version

Versions greater than 13.3.x (log files generated with v14.0.0-rc.2 c85903e0)

Description of code modifications

None.

Log files

I've attached four archived runs - a debug and release build run for both MERRA2 and GEOS-FP. I'm much more invested in MERRA2 than GEOS-FP, but I've posted both just cause.

MERRA2_DEBUG_FAILURE.tar.gz MERRA2_NON_DEBUG_FAILURE.tar.gz GEOSFP_DEBUG_FAILURE.tar.gz GEOSFP_NON_DEBUG_FAILURE.tar.gz

Software versions

yantosca commented 1 year ago

Also tagging @Jourdan-He and @SaptSinha

yantosca commented 1 year ago

Hi @kilicomu, thanks for bringing this up. This looks like a good old-fashioned div-by-zero. We could put an error trap on that:

H%frac_SALACL = 0.0_dp
IF ( C(ind_SALACL) + C(ind_NIT) + C(ind_SO4) > 0.0_dp ) THEN
   H%frac_SALACL = C(ind_SALACL) / ( C(ind_SALACL) + C(ind_NIT) + C(ind_SO4) )
ENDIF

Now, as to why GCHP is causing a div-by-zero here, that's another matter. Maybe the scavenging is too vigorous. Or the initial conditions of e.g. SALACL is too low to start with.

kilicomu commented 1 year ago

@yantosca Thanks for looking! Yep, it's a division by 0 - I didn't want to put a trap in without understanding the implications! I'll patch my code and see where that gets me.

kilicomu commented 1 year ago

The other thing to note is that I can run fine on another with machine, which adds to the mystery (at least for me...).

sdeastham commented 1 year ago

Wow, this is weird. I was going to suggest double-checking that the restarts are OK but it sounds like you've already done that. One thing which springs to mind though - can you provide your run script please, and the log of output from running it? I didn't see it in your zipped run directories, and there were a couple of big changes in the recent GCHP run directory structures which necessitated some reworking of the run script.

lizziel commented 1 year ago

I also wonder about your successful run with another machine. Did you use different library versions for that run? Do you have the logs, config files, and build info for that run for comparison?

kilicomu commented 1 year ago

@yantosca I tried the trap with no luck, however...

For some unknown reason v14 rc2 has now started working on the cluster. The only thing I can think that I have done extra is updated the various OFFLINE_* emissions (we had a few older versions about). I don't know enough to know whether or not that is likely to have helped me with this problem, and I tried reverting each of them back to their previous version (in isolation, not in combination) but wasn't able to reproduce the crash.

I'll do some more testing to make sure that it's running ok. Very confused at the moment.

kilicomu commented 1 year ago

Okay, v14 is working on our cluster now, so I'll close this off.

Not sure what was causing the issue - if it reappears, I'll reopen / submit a new issue.

yantosca commented 1 year ago

Thanks, and keep us posted @kilicomu!

gopikrishnangs44 commented 1 year ago

Dear @yantosca

I am also facing the same issue in GCHP 13.4.1

===============================================================================
WETDEP: ERROR at    3  24  71 for species  138 in area RESUSPENSION in middle levels
 LS          :  T
 PDOWN       :    0.0000000000000000
 QQ          :    0.0000000000000000
 ALPHA       :    0.0000000000000000
 ALPHA2      :    0.0000000000000000
 RAINFRAC    :    0.0000000000000000
 WASHFRAC    :    0.0000000000000000
 MASS_WASH   :    0.0000000000000000
 MASS_NOWASH :    0.0000000000000000
 WETLOSS     :   -0.0000000000000000
 GAINED      :    0.0000000000000000
 LOST        :    0.0000000000000000
 DSpc(NW,:)  :    0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000
 Spc(I,J,:N) :   -0.0000000000000000       -0.0000000000000000       -0.0000000000000000       -0.0000000000000000       -0.0000000000000000       -0.0000000000000000       -0.0000000000000000       -0.0000000000000000       -0.0000000000000000       -0.0000000000000000       -9.8013765304884227E-014  -4.0817861128689080E-013  -1.0324533279007332E-012  -2.0711193567688042E-012  -3.6405626033690480E-013  -2.6209416130405968E-013  -2.3673551955976509E-013  -2.3740312524962716E-013  -1.4234488710346750E-012  -1.4509473118661018E-011  -2.3968404449659108E-012  -1.0287849195419953E-012  -4.5864945968005623E-013  -8.7890911707160104E-015  -4.2659806717846077E-014  -4.1756321310777058E-014  -4.0220692421135937E-014  -1.3451420263145457E-013  -3.0000055059266572E-013  -1.9191988990041727E-013  -1.7040181084133550E-014  -6.3353635697744400E-015  -4.1581902659834436E-016  -3.0150800557263397E-017  -2.1140404894559353E-018  -1.3207868296327852E-019  -1.9370730337060598E-019  -7.4754814450147021E-020  -4.6970747174663798E-021  -1.9126806928154054E-022  -3.4985176380119531E-023  -1.6744150964025935E-024  -1.7268226344822395E-025  -7.0195597929206797E-027  -4.3447839439029528E-028  -6.8643451151811298E-030  -7.9654884283347433E-031  -1.3097758150016814E-032  -5.1546556238383985E-034  -6.2718834206183521E-036  -5.2068049614229462E-037  -5.5637034090873920E-037  -8.7537125081611168E-038  -1.4970268187844388E-039  -7.1066636296705220E-040  -6.2930450442075586E-039  -6.1786462623629168E-038  -1.1944436317234292E-036  -6.2755035758343011E-035  -2.3866020461974976E-033  -2.8723629669265353E-033  -2.6752945237784374E-033  -2.6472153664589243E-033  -2.2622804766059453E-033  -1.9050968892677856E-033  -1.4967468261472745E-033  -1.0793752081898327E-033  -9.4102115545591866E-034  -9.3673571409477690E-034  -8.9047746392126991E-034  -8.1099428567308586E-034  -7.1127574568423450E-034
===============================================================================

GEOS-Chem ERROR [0009]: Error encountered in wet deposition!
 --> LOCATION:  -> at SAFETY (in module GeosCore/wetscav_mod.F90)

GEOS-Chem ERROR [0009]: Error encountered in "Safety"!
 --> LOCATION:  -> at Do_Complete_Reevap (in module GeosCore/wetscav_mod.F90)

GEOS-Chem ERROR [0009]:
 --> LOCATION:  -> at WetDep (in module GeosCore/wetscav_mod.F90)

GEOS-Chem ERROR [0009]: Error encountered in "Wetdep"!
 --> LOCATION:  -> at Do_WetDep (in module GeosCore/wetscav_mod.F90)
pe=00009 FAIL at line=01358    gchp_chunk_mod.F90                       <Error calling DO_WETDEP>
pe=00009 FAIL at line=03680    Chem_GridCompMod.F90                     <status=1>
pe=00009 FAIL at line=02734    Chem_GridCompMod.F90                     <status=1>
pe=00009 FAIL at line=01844    MAPL_Generic.F90                         <Error during the 'Run' stage of the gridded component 'GCHPchem'>
pe=00009 FAIL at line=00556    GCHP_GridCompMod.F90                     <status=1>
pe=00009 FAIL at line=01844    MAPL_Generic.F90                         <Error during the 'Run' stage of the gridded component 'GCHP'>
pe=00009 FAIL at line=01257    MAPL_CapGridComp.F90                     <status=1>
pe=00009 FAIL at line=01181    MAPL_CapGridComp.F90                     <status=1>
pe=00009 FAIL at line=00804    MAPL_CapGridComp.F90                     <status=1>
pe=00009 FAIL at line=00934    MAPL_CapGridComp.F90                     <status=1>
pe=00009 FAIL at line=00247    MAPL_Cap.F90                             <status=1>
pe=00009 FAIL at line=00211    MAPL_Cap.F90                             <status=1>
pe=00009 FAIL at line=00154    MAPL_Cap.F90                             <status=1>
pe=00009 FAIL at line=00129    MAPL_Cap.F90                             <status=1>
pe=00009 FAIL at line=00031    GCHPctm.F90                              <status=1>
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 9 in communicator MPI_COMM_WORLD
with errorcode 0.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
gopikrishnangs44 commented 1 year ago

This is an issue with the time step used for the simulation, as suggested in the previous comments.

Earlier, I have been using a 3600/1800 time step. Now I changed the same into 1800/900 which solved the issue.