Closed msulprizio closed 2 years ago
Some more backstory: I ran a 1-month benchmark (compiled with GNU Compiler collection 9.3.0) with GCHP 13.3.0-alpha.9, which died at 20190705 with the above-mentioned "Integrate failed twice" error. I then re-ran starting from 0 GMT on 20190705 and the run kept on going.
I then ran again from 20190701 having compiled GCHP with all debug flags, and the run did not stop with error for several days. I ended up cancelling the run before it failed.
I also tried running with ifort19 to see if this is an issue with the GNU compiler only, but that simulation crashed on day 8 with an integrate error:
Forced exit from Rosenbrock due to the following error:
--> Step size too small: T + 10*H = T or H < Roundoff
T= 245.778383327478 and H= 1.417091866022493E-013
### INTEGRATE RETURNED ERROR AT: 11 8 14
Forced exit from Rosenbrock due to the following error:
--> Step size too small: T + 10*H = T or H < Roundoff
T= 175.018104784995 and H= 6.323346512749509E-014
## INTEGRATE FAILED TWICE !!!
===============================================================================
GEOS-CHEM ERROR: Integrator error code : -7
STOP at INTEGRATE_KPP
===============================================================================
Internal benchmark plots for 13.3.0-alpha.8 are now available here. @sdeastham, do these look okay to you? I know you were interested in seeing a GCC vs GCHP comparison for 13.3. Since GCHP 13.3 is failing this is the only comparison available. The error we are encountering may be from the next alpha (alpha.9) (https://github.com/geoschem/geos-chem/pull/923).
I was able to reproduce the integration error for alpha.9 with an out-of-the-box benchmark run. I also hit it in a benchmark c24 run and a standard c24 run as well. Integration errors crept in several days into the run but it did not crash for several days after that when it finally hit an "Integrate failed twice" error which stopped the model.
I added debug code from Bob to print species conc and reaction rates if integrate failed twice error is hit. I also added the debug code to where there is a single integration error and the model keeps going. I will post the log when the run stops.
Following up, here is the printout for the species concentration and reaction rates at the first integration error encountered in GCHP 13.3.0-alpha.9. error.txt
Thanks for this @lizziel ! It looks like this alpha does include the fix that Viral found - and worryingly it looks like GCHP and GCC still differ significantly with regards to Bry concentrations. I'll hold off drawing any real conclusions until we have that GCHP benchmark to compare to GCC for v13.3 proper.
With regards to the error in alpha 9, I'm afraid I don't see any really obvious red flags in the debug printout. None of the species concentrations or reaction rates look totally bonkers - even the really high rates seem to be for reactions which I would expect to be extremely fast. The best thoughts I have are to:
Hi all. I'm also attaching my GCHP log file (gchp_13.3.0-rc.0_ifort19.log.txt) from the 13.3.0-rc.0 run with ifort19. As mentioned above, this run crashed after 8 days with the integrate failed twice error. The log files includes the debug printouts of species concentrations and reaction rates at the problematic grid box (in this case 11, 8, 14).
The updated IONO2 hydrolysis in PR https://github.com/geoschem/geos-chem/pull/923 seems to be causing this error. That update modified the following reactions in fullchem.eqn.
Removed:
HOI = ISALA : IuptkByAlkSALA1stOrd( SR_MW(ind_HOI), 0.01_dp, State_Het ); {2017/09/22; Sherwen2016b;TS,JAS,SDE}
HOI = ISALC : IuptkByAlkSALC1stOrd( SR_MW(ind_HOI), 0.01_dp, State_Het ); {2017/09/22; Sherwen2016b;TS,JAS,SDE}
IONO = ISALA : IuptkByAlkSALA1stOrd( SR_MW(ind_IONO), 0.02_dp, State_Het ); {2017/09/22; Sherwen2016b;TS,JAS,SDE}
IONO = ISALC : IuptkByAlkSALC1stOrd( SR_MW(ind_IONO), 0.02_dp, State_Het ); {2017/09/22; Sherwen2016b;TS,JAS,SDE}
IONO2 = ISALA : IuptkByAlkSALA1stOrd( SR_MW(ind_IONO2), 0.01_dp, State_Het ); {2017/09/22; Sherwen2016b;TS,JAS,SDE}
IONO2 = ISALC : IuptkByAlkSALC1stOrd( SR_MW(ind_IONO2), 0.01_dp, State_Het ); {2017/09/22; Sherwen2016b;TS,JAS,SDE}
Added:
IONO2 + H2O = HOI + HNO3 : IONO2uptkByH2O( State_Het ); {2021/09/16 XW, TSherwen}
In focusing on the debug printout for the above species and reactions in both my log file and the one posted above by Lizzie, I noticed very low reaction rates for this new reaction.
Lizzie's error.txt:
5.8516488151640404E-022 IONO2 + H2O --> HOI + HNO3
My log file:
1.596795261397990E-021 IONO2 + H2O --> HOI + HNO3
Do we have a sense for a reasonable value here? Would it make sense to shunt this reaction if values are low? Tagging @tsherwen @yantosca here for input since they are most familiar with the code updates.
@yantosca reported the issue may be in heterogeneous chemistry reaction rates blowing up for IONO2/IONO + BrSALC. My log file that I posted earlier indeed shows that:
> 1.1950987868203623E-005 IONO + BrSALA --> LOx + IBr + HNO2
> **6801472.0895964401 IONO + BrSALC --> LOx + IBr + HNO2**
> 5.1150400834920699E-015 IONO + SALACL --> LOx + ICl + HNO2
> 3.5366517464898399E-010 IONO + SALCCL --> LOx + ICl + HNO2
> 5.7602301289131510E-006 IONO2 + BrSALA --> LOx + IBr + HNO3
> **111562428.34602603 IONO2 + BrSALC --> LOx + IBr + HNO3**
> 2.4653868219479857E-015 IONO2 + SALACL --> LOx + ICl + HNO3
> 1.7830621336654444E-010 IONO2 + SALCCL --> LOx + ICl + HNO3
@sdeastham, I looked into whether it was running into integration errors in cloud boxes and indeed it was, at least for the first several integration errors that cropped up over a few days. Cloud was around 0.18 in the first integration error box.
BTW, it looks like the GCHP benchmark is blowing up because IONO + BrSALC and IONO2 + BrSALC het rates are blowing up:
4.9678489882643267E-004 HBr --> BrSALA
3.2607058417964628E-006 HBr --> BrSALC
1.4499203489447897E-003 HI --> AERI
2.5899073459324751E-005 HI --> ISALA
6.7283590710720089E-006 HI --> ISALC
2.0386692041413055E-004 I2O2 --> 2 AERI + 2 LOx
3.7554878016218852E-006 I2O2 --> 2 ISALA + 2 LOx
1.4145774628374170E-006 I2O2 --> 2 ISALC + 2 LOx
1.9842995393414253E-004 I2O3 --> 2 AERI + 3 LOx
3.6558160416080298E-006 I2O3 --> 2 ISALA + 3 LOx
1.3797437686898082E-006 I2O3 --> 2 ISALC + 3 LOx
1.9340720969139444E-004 I2O4 --> 2 AERI + 4 LOx
3.5637175014994857E-006 I2O4 --> 2 ISALA + 4 LOx
1.3474520299088180E-006 I2O4 --> 2 ISALC + 4 LOx
1.3372563189859506E-021 IONO2 + H2O --> HOI + HNO3
1.2861851491670647E-016 IONO + BrSALA --> LOx + IBr + HNO2
145321105.36913359 IONO + BrSALC --> LOx + IBr + HNO2
3.3944431861508362E-012 IONO + SALACL --> LOx + ICl + HNO2
7.1816909520611817E-016 IONO + SALCCL --> LOx + ICl + HNO2
6.2077319451027596E-017 IONO2 + BrSALA --> LOx + IBr + HNO3
476876756.45259035 IONO2 + BrSALC --> LOx + IBr + HNO3
1.6383172684081340E-012 IONO2 + SALACL --> LOx + ICl + HNO3
3.6525271819296414E-016 IONO2 + SALCCL --> LOx + ICl + HNO3
Most hetchem rates should be small (definitely less than 1, maybe in the range 1e-5 to 1e-20). The rates that are very much bigger than 1 are too high.
These rates were blowing up because BrSALC had a very low concentration (1e-16 molec/cm3), which caused a division in rate law routine kIIR1Ltd
to result in a very large reaction rate. The division in question is here:
! Prevent div by zero. NOTE: Now use 1.0e-20 as the error trap for
! concEduct. 100 (the previous criterion) was too large. We should
! never have have a concentration as low as 1e-20 molec/cm3, as that
! will definitely blow up the division. (bmy, 9/20/21)
IF ( concEduct < 1.0e-20_dp ) RETURN
I think it comes down to this error trap. It was letting BrSALC = 1e-16 molec/cm3 through. I will try again with this error criterion changed to 1e-8 to see if the run gets through.
I should also mention, the original error criterion in gckpp_HetRates.F90
was IF ( concEduct < 100 ) RETURN
. But this was too high and some reactions that should have gotten done were flushed to zero. I had tried to make this error criterion more generous but now I think it needs to be tightened up, as mentioned above.
Aha! I vaguely remember that concEduct
setting - I think I'm responsible for setting it to 100. I'm surprised that 100 molec/cm3 was too high - even at 50 km the air density is 7.7e16 molec/cm3, so a concentration of 100 molec/cm3 is around 1 part per quadrillion.
This has been addressed with commits: https://github.com/geoschem/geos-chem/pull/985/commits/d19c493304e909ea3985c11ecd1cfec592b39190 and https://github.com/geoschem/geos-chem/pull/985/commits/7e2be0db4cb484bf174f3730c6c8d7c346cba67b. These have now been merged into our 13.3.0 release development stream. We will run benchmarks for GCC_13.3.0-rc.1 and GCHP_13.3.0-rc.1.
The GCClassic 13.3.0 1-month benchmark finished, but GCHP crashed 10 days in with an FPE error. Both simulations use gcc 10.2.0 as the compiler.
Looking back at the internal benchmarks for 13.3.0, this error may have originated in 13.3.0-alpha.9 with the updates for IONO2 hydrolysis. That run failed after 5 simulation days with an integrator error.