geoschem / GCHP

The "superproject" wrapper repository for GCHP, the high-performance instance of the GEOS-Chem chemical-transport model.
https://gchp.readthedocs.io
Other
23 stars 25 forks source link

[BUG/ISSUE] 1-month benchmark with GCHP 13.3.0-rc.0 crashes with a floating-point exception error #170

Closed msulprizio closed 2 years ago

msulprizio commented 3 years ago

The GCClassic 13.3.0 1-month benchmark finished, but GCHP crashed 10 days in with an FPE error. Both simulations use gcc 10.2.0 as the compiler.

Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.

Backtrace for this error:
#0  0x2abc63ac43ff in ???
#1  0x744696 in __gckpp_linearalgebra_MOD_kppdecomp
        at /n/home05/msulprizio/BM/1mon/GCHP_13.3.0-rc.0/CodeDir/src/GCHP_GridComp/GEOSChem_GridComp/geos-chem/KPP/fullchem/gckpp_LinearAlgebra.F90:68
#2  0x6bd7af in ros_decomp
        at /n/home05/msulprizio/BM/1mon/GCHP_13.3.0-rc.0/CodeDir/src/GCHP_GridComp/GEOSChem_GridComp/geos-chem/KPP/fullchem/gckpp_Integrator.F90:855
#3  0x6bd7af in ros_preparematrix
        at /n/home05/msulprizio/BM/1mon/GCHP_13.3.0-rc.0/CodeDir/src/GCHP_GridComp/GEOSChem_GridComp/geos-chem/KPP/fullchem/gckpp_Integrator.F90:794
#4  0x6bd7af in ros_integrator
        at /n/home05/msulprizio/BM/1mon/GCHP_13.3.0-rc.0/CodeDir/src/GCHP_GridComp/GEOSChem_GridComp/geos-chem/KPP/fullchem/gckpp_Integrator.F90:576
#5  0x6bf4db in __gckpp_integrator_MOD_rosenbrock
        at /n/home05/msulprizio/BM/1mon/GCHP_13.3.0-rc.0/CodeDir/src/GCHP_GridComp/GEOSChem_GridComp/geos-chem/KPP/fullchem/gckpp_Integrator.F90:417
#6  0x6c0454 in __gckpp_integrator_MOD_integrate
        at /n/home05/msulprizio/BM/1mon/GCHP_13.3.0-rc.0/CodeDir/src/GCHP_GridComp/GEOSChe
m_GridComp/geos-chem/KPP/fullchem/gckpp_Integrator.F90:105
#7  0x4c77d5 in __fullchem_mod_MOD_do_fullchem
        at /n/home05/msulprizio/BM/1mon/GCHP_13.3.0-rc.0/CodeDir/src/GCHP_GridComp/GEOSChem_GridComp/geos-chem/GeosCore/fullchem_mod.F90:1172
#8  0x47deb9 in __chemistry_mod_MOD_do_chemistry
        at /n/home05/msulprizio/BM/1mon/GCHP_13.3.0-rc.0/CodeDir/src/GCHP_GridComp/GEOSChem_GridComp/geos-chem/GeosCore/chemistry_mod.F90:310
#9  0x456415 in __gchp_chunk_mod_MOD_gchp_chunk_run
        at /n/home05/msulprizio/BM/1mon/GCHP_13.3.0-rc.0/CodeDir/src/GCHP_GridComp/GEOSChem_GridComp/geos-chem/Interfaces/GCHP/gchp_chunk_mod.F90:1329
#10  0x446c0d in run_
        at /n/home05/msulprizio/BM/1mon/GCHP_13.3.0-rc.0/CodeDir/src/GCHP_GridComp/GEOSChem_GridComp/geos-chem/Interfaces/GCHP/Chem_GridCompMod.F90:3529
#11  0x44c1db in run2
        at /n/home05/msulprizio/BM/1mon/GCHP_13.3.0-rc.0/CodeDir/src/GCHP_GridComp/GEOSChem_GridComp/geos-chem/Interfaces/GCHP/Chem_GridCompMod.F90:2678

Looking back at the internal benchmarks for 13.3.0, this error may have originated in 13.3.0-alpha.9 with the updates for IONO2 hydrolysis. That run failed after 5 simulation days with an integrator error.

yantosca commented 3 years ago

Some more backstory: I ran a 1-month benchmark (compiled with GNU Compiler collection 9.3.0) with GCHP 13.3.0-alpha.9, which died at 20190705 with the above-mentioned "Integrate failed twice" error. I then re-ran starting from 0 GMT on 20190705 and the run kept on going.

I then ran again from 20190701 having compiled GCHP with all debug flags, and the run did not stop with error for several days. I ended up cancelling the run before it failed.

msulprizio commented 3 years ago

I also tried running with ifort19 to see if this is an issue with the GNU compiler only, but that simulation crashed on day 8 with an integrate error:

 Forced exit from Rosenbrock due to the following error:
 --> Step size too small: T + 10*H = T or H < Roundoff
 T=   245.778383327478      and H=  1.417091866022493E-013
 ### INTEGRATE RETURNED ERROR AT:           11           8          14
 Forced exit from Rosenbrock due to the following error:
 --> Step size too small: T + 10*H = T or H < Roundoff
 T=   175.018104784995      and H=  6.323346512749509E-014
 ## INTEGRATE FAILED TWICE !!! 
===============================================================================
GEOS-CHEM ERROR: Integrator error code : -7
STOP at INTEGRATE_KPP
===============================================================================
lizziel commented 3 years ago

Internal benchmark plots for 13.3.0-alpha.8 are now available here. @sdeastham, do these look okay to you? I know you were interested in seeing a GCC vs GCHP comparison for 13.3. Since GCHP 13.3 is failing this is the only comparison available. The error we are encountering may be from the next alpha (alpha.9) (https://github.com/geoschem/geos-chem/pull/923).

I was able to reproduce the integration error for alpha.9 with an out-of-the-box benchmark run. I also hit it in a benchmark c24 run and a standard c24 run as well. Integration errors crept in several days into the run but it did not crash for several days after that when it finally hit an "Integrate failed twice" error which stopped the model.

I added debug code from Bob to print species conc and reaction rates if integrate failed twice error is hit. I also added the debug code to where there is a single integration error and the model keeps going. I will post the log when the run stops.

lizziel commented 3 years ago

Following up, here is the printout for the species concentration and reaction rates at the first integration error encountered in GCHP 13.3.0-alpha.9. error.txt

sdeastham commented 3 years ago

Thanks for this @lizziel ! It looks like this alpha does include the fix that Viral found - and worryingly it looks like GCHP and GCC still differ significantly with regards to Bry concentrations. I'll hold off drawing any real conclusions until we have that GCHP benchmark to compare to GCC for v13.3 proper.

With regards to the error in alpha 9, I'm afraid I don't see any really obvious red flags in the debug printout. None of the species concentrations or reaction rates look totally bonkers - even the really high rates seem to be for reactions which I would expect to be extremely fast. The best thoughts I have are to:

  1. Perhaps look at whether this is a cloudy grid box (if I read the output correctly and we're showing box I,J,K, then this is level 20 at ~4 km, so this could well be a cloudy cell); and
  2. Start regressing by one commit at a time to figure out which commit caused the issue (not hugely helpful I know!)
msulprizio commented 3 years ago

Hi all. I'm also attaching my GCHP log file (gchp_13.3.0-rc.0_ifort19.log.txt) from the 13.3.0-rc.0 run with ifort19. As mentioned above, this run crashed after 8 days with the integrate failed twice error. The log files includes the debug printouts of species concentrations and reaction rates at the problematic grid box (in this case 11, 8, 14).

The updated IONO2 hydrolysis in PR https://github.com/geoschem/geos-chem/pull/923 seems to be causing this error. That update modified the following reactions in fullchem.eqn.

Removed:

HOI   = ISALA : IuptkByAlkSALA1stOrd( SR_MW(ind_HOI),   0.01_dp, State_Het ); {2017/09/22; Sherwen2016b;TS,JAS,SDE}
HOI   = ISALC : IuptkByAlkSALC1stOrd( SR_MW(ind_HOI),   0.01_dp, State_Het ); {2017/09/22; Sherwen2016b;TS,JAS,SDE}
IONO  = ISALA : IuptkByAlkSALA1stOrd( SR_MW(ind_IONO),  0.02_dp, State_Het ); {2017/09/22; Sherwen2016b;TS,JAS,SDE}
IONO  = ISALC : IuptkByAlkSALC1stOrd( SR_MW(ind_IONO),  0.02_dp, State_Het ); {2017/09/22; Sherwen2016b;TS,JAS,SDE}
IONO2 = ISALA : IuptkByAlkSALA1stOrd( SR_MW(ind_IONO2), 0.01_dp, State_Het ); {2017/09/22; Sherwen2016b;TS,JAS,SDE}
IONO2 = ISALC : IuptkByAlkSALC1stOrd( SR_MW(ind_IONO2), 0.01_dp, State_Het ); {2017/09/22; Sherwen2016b;TS,JAS,SDE}

Added:

IONO2 + H2O = HOI + HNO3 : IONO2uptkByH2O( State_Het ); {2021/09/16 XW, TSherwen}

In focusing on the debug printout for the above species and reactions in both my log file and the one posted above by Lizzie, I noticed very low reaction rates for this new reaction.

Lizzie's error.txt: 5.8516488151640404E-022 IONO2 + H2O --> HOI + HNO3

My log file: 1.596795261397990E-021 IONO2 + H2O --> HOI + HNO3

Do we have a sense for a reasonable value here? Would it make sense to shunt this reaction if values are low? Tagging @tsherwen @yantosca here for input since they are most familiar with the code updates.

lizziel commented 3 years ago

@yantosca reported the issue may be in heterogeneous chemistry reaction rates blowing up for IONO2/IONO + BrSALC. My log file that I posted earlier indeed shows that:

>    1.1950987868203623E-005 IONO + BrSALA --> LOx + IBr + HNO2
>    **6801472.0895964401      IONO + BrSALC --> LOx + IBr + HNO2**
>    5.1150400834920699E-015 IONO + SALACL --> LOx + ICl + HNO2
>    3.5366517464898399E-010 IONO + SALCCL --> LOx + ICl + HNO2
>    5.7602301289131510E-006 IONO2 + BrSALA --> LOx + IBr + HNO3
>    **111562428.34602603      IONO2 + BrSALC --> LOx + IBr + HNO3**
>    2.4653868219479857E-015 IONO2 + SALACL --> LOx + ICl + HNO3
>    1.7830621336654444E-010 IONO2 + SALCCL --> LOx + ICl + HNO3

@sdeastham, I looked into whether it was running into integration errors in cloud boxes and indeed it was, at least for the first several integration errors that cropped up over a few days. Cloud was around 0.18 in the first integration error box.

yantosca commented 3 years ago

BTW, it looks like the GCHP benchmark is blowing up because IONO + BrSALC and IONO2 + BrSALC het rates are blowing up:

   4.9678489882643267E-004 HBr --> BrSALA
   3.2607058417964628E-006 HBr --> BrSALC
   1.4499203489447897E-003 HI --> AERI
   2.5899073459324751E-005 HI --> ISALA
   6.7283590710720089E-006 HI --> ISALC
   2.0386692041413055E-004 I2O2 --> 2 AERI + 2 LOx
   3.7554878016218852E-006 I2O2 --> 2 ISALA + 2 LOx
   1.4145774628374170E-006 I2O2 --> 2 ISALC + 2 LOx
   1.9842995393414253E-004 I2O3 --> 2 AERI + 3 LOx
   3.6558160416080298E-006 I2O3 --> 2 ISALA + 3 LOx
   1.3797437686898082E-006 I2O3 --> 2 ISALC + 3 LOx
   1.9340720969139444E-004 I2O4 --> 2 AERI + 4 LOx
   3.5637175014994857E-006 I2O4 --> 2 ISALA + 4 LOx
   1.3474520299088180E-006 I2O4 --> 2 ISALC + 4 LOx
   1.3372563189859506E-021 IONO2 + H2O --> HOI + HNO3
   1.2861851491670647E-016 IONO + BrSALA --> LOx + IBr + HNO2
   145321105.36913359      IONO + BrSALC --> LOx + IBr + HNO2
   3.3944431861508362E-012 IONO + SALACL --> LOx + ICl + HNO2
   7.1816909520611817E-016 IONO + SALCCL --> LOx + ICl + HNO2
   6.2077319451027596E-017 IONO2 + BrSALA --> LOx + IBr + HNO3
   476876756.45259035      IONO2 + BrSALC --> LOx + IBr + HNO3
   1.6383172684081340E-012 IONO2 + SALACL --> LOx + ICl + HNO3
   3.6525271819296414E-016 IONO2 + SALCCL --> LOx + ICl + HNO3

Most hetchem rates should be small (definitely less than 1, maybe in the range 1e-5 to 1e-20). The rates that are very much bigger than 1 are too high.

These rates were blowing up because BrSALC had a very low concentration (1e-16 molec/cm3), which caused a division in rate law routine kIIR1Ltd to result in a very large reaction rate. The division in question is here:

    ! Prevent div by zero.   NOTE: Now use 1.0e-20 as the error trap for
    ! concEduct.  100 (the previous criterion) was too large.  We should
    ! never have have a concentration as low as 1e-20 molec/cm3, as that
    ! will definitely blow up the division. (bmy, 9/20/21)
    IF ( concEduct < 1.0e-20_dp                          ) RETURN

I think it comes down to this error trap. It was letting BrSALC = 1e-16 molec/cm3 through. I will try again with this error criterion changed to 1e-8 to see if the run gets through.

yantosca commented 3 years ago

I should also mention, the original error criterion in gckpp_HetRates.F90 was IF ( concEduct < 100 ) RETURN. But this was too high and some reactions that should have gotten done were flushed to zero. I had tried to make this error criterion more generous but now I think it needs to be tightened up, as mentioned above.

sdeastham commented 3 years ago

Aha! I vaguely remember that concEduct setting - I think I'm responsible for setting it to 100. I'm surprised that 100 molec/cm3 was too high - even at 50 km the air density is 7.7e16 molec/cm3, so a concentration of 100 molec/cm3 is around 1 part per quadrillion.

yantosca commented 3 years ago

This has been addressed with commits: https://github.com/geoschem/geos-chem/pull/985/commits/d19c493304e909ea3985c11ecd1cfec592b39190 and https://github.com/geoschem/geos-chem/pull/985/commits/7e2be0db4cb484bf174f3730c6c8d7c346cba67b. These have now been merged into our 13.3.0 release development stream. We will run benchmarks for GCC_13.3.0-rc.1 and GCHP_13.3.0-rc.1.