Closed kilicomu closed 1 year ago
Also tagging @Jourdan-He and @SaptSinha
Hi @kilicomu, thanks for bringing this up. This looks like a good old-fashioned div-by-zero. We could put an error trap on that:
H%frac_SALACL = 0.0_dp
IF ( C(ind_SALACL) + C(ind_NIT) + C(ind_SO4) > 0.0_dp ) THEN
H%frac_SALACL = C(ind_SALACL) / ( C(ind_SALACL) + C(ind_NIT) + C(ind_SO4) )
ENDIF
Now, as to why GCHP is causing a div-by-zero here, that's another matter. Maybe the scavenging is too vigorous. Or the initial conditions of e.g. SALACL is too low to start with.
@yantosca Thanks for looking! Yep, it's a division by 0 - I didn't want to put a trap in without understanding the implications! I'll patch my code and see where that gets me.
The other thing to note is that I can run fine on another with machine, which adds to the mystery (at least for me...).
Wow, this is weird. I was going to suggest double-checking that the restarts are OK but it sounds like you've already done that. One thing which springs to mind though - can you provide your run script please, and the log of output from running it? I didn't see it in your zipped run directories, and there were a couple of big changes in the recent GCHP run directory structures which necessitated some reworking of the run script.
I also wonder about your successful run with another machine. Did you use different library versions for that run? Do you have the logs, config files, and build info for that run for comparison?
@yantosca I tried the trap with no luck, however...
For some unknown reason v14 rc2 has now started working on the cluster. The only thing I can think that I have done extra is updated the various OFFLINE_*
emissions (we had a few older versions about). I don't know enough to know whether or not that is likely to have helped me with this problem, and I tried reverting each of them back to their previous version (in isolation, not in combination) but wasn't able to reproduce the crash.
I'll do some more testing to make sure that it's running ok. Very confused at the moment.
Okay, v14 is working on our cluster now, so I'll close this off.
Not sure what was causing the issue - if it reappears, I'll reopen / submit a new issue.
Thanks, and keep us posted @kilicomu!
Dear @yantosca
I am also facing the same issue in GCHP 13.4.1
===============================================================================
WETDEP: ERROR at 3 24 71 for species 138 in area RESUSPENSION in middle levels
LS : T
PDOWN : 0.0000000000000000
QQ : 0.0000000000000000
ALPHA : 0.0000000000000000
ALPHA2 : 0.0000000000000000
RAINFRAC : 0.0000000000000000
WASHFRAC : 0.0000000000000000
MASS_WASH : 0.0000000000000000
MASS_NOWASH : 0.0000000000000000
WETLOSS : -0.0000000000000000
GAINED : 0.0000000000000000
LOST : 0.0000000000000000
DSpc(NW,:) : 0.0000000000000000 0.0000000000000000 0.0000000000000000 0.0000000000000000 0.0000000000000000 0.0000000000000000 0.0000000000000000 0.0000000000000000 0.0000000000000000 0.0000000000000000 0.0000000000000000 0.0000000000000000 0.0000000000000000 0.0000000000000000 0.0000000000000000 0.0000000000000000 0.0000000000000000 0.0000000000000000 0.0000000000000000 0.0000000000000000 0.0000000000000000 0.0000000000000000 0.0000000000000000 0.0000000000000000 0.0000000000000000 0.0000000000000000 0.0000000000000000 0.0000000000000000 0.0000000000000000 0.0000000000000000 0.0000000000000000 0.0000000000000000 0.0000000000000000 0.0000000000000000 0.0000000000000000 0.0000000000000000 0.0000000000000000 0.0000000000000000 0.0000000000000000 0.0000000000000000 0.0000000000000000 0.0000000000000000 0.0000000000000000 0.0000000000000000 0.0000000000000000 0.0000000000000000 0.0000000000000000 0.0000000000000000 0.0000000000000000 0.0000000000000000 0.0000000000000000 0.0000000000000000 0.0000000000000000 0.0000000000000000 0.0000000000000000 0.0000000000000000 0.0000000000000000 0.0000000000000000 0.0000000000000000 0.0000000000000000 0.0000000000000000 0.0000000000000000 0.0000000000000000 0.0000000000000000 0.0000000000000000 0.0000000000000000 0.0000000000000000 0.0000000000000000 0.0000000000000000 0.0000000000000000 0.0000000000000000 0.0000000000000000
Spc(I,J,:N) : -0.0000000000000000 -0.0000000000000000 -0.0000000000000000 -0.0000000000000000 -0.0000000000000000 -0.0000000000000000 -0.0000000000000000 -0.0000000000000000 -0.0000000000000000 -0.0000000000000000 -9.8013765304884227E-014 -4.0817861128689080E-013 -1.0324533279007332E-012 -2.0711193567688042E-012 -3.6405626033690480E-013 -2.6209416130405968E-013 -2.3673551955976509E-013 -2.3740312524962716E-013 -1.4234488710346750E-012 -1.4509473118661018E-011 -2.3968404449659108E-012 -1.0287849195419953E-012 -4.5864945968005623E-013 -8.7890911707160104E-015 -4.2659806717846077E-014 -4.1756321310777058E-014 -4.0220692421135937E-014 -1.3451420263145457E-013 -3.0000055059266572E-013 -1.9191988990041727E-013 -1.7040181084133550E-014 -6.3353635697744400E-015 -4.1581902659834436E-016 -3.0150800557263397E-017 -2.1140404894559353E-018 -1.3207868296327852E-019 -1.9370730337060598E-019 -7.4754814450147021E-020 -4.6970747174663798E-021 -1.9126806928154054E-022 -3.4985176380119531E-023 -1.6744150964025935E-024 -1.7268226344822395E-025 -7.0195597929206797E-027 -4.3447839439029528E-028 -6.8643451151811298E-030 -7.9654884283347433E-031 -1.3097758150016814E-032 -5.1546556238383985E-034 -6.2718834206183521E-036 -5.2068049614229462E-037 -5.5637034090873920E-037 -8.7537125081611168E-038 -1.4970268187844388E-039 -7.1066636296705220E-040 -6.2930450442075586E-039 -6.1786462623629168E-038 -1.1944436317234292E-036 -6.2755035758343011E-035 -2.3866020461974976E-033 -2.8723629669265353E-033 -2.6752945237784374E-033 -2.6472153664589243E-033 -2.2622804766059453E-033 -1.9050968892677856E-033 -1.4967468261472745E-033 -1.0793752081898327E-033 -9.4102115545591866E-034 -9.3673571409477690E-034 -8.9047746392126991E-034 -8.1099428567308586E-034 -7.1127574568423450E-034
===============================================================================
GEOS-Chem ERROR [0009]: Error encountered in wet deposition!
--> LOCATION: -> at SAFETY (in module GeosCore/wetscav_mod.F90)
GEOS-Chem ERROR [0009]: Error encountered in "Safety"!
--> LOCATION: -> at Do_Complete_Reevap (in module GeosCore/wetscav_mod.F90)
GEOS-Chem ERROR [0009]:
--> LOCATION: -> at WetDep (in module GeosCore/wetscav_mod.F90)
GEOS-Chem ERROR [0009]: Error encountered in "Wetdep"!
--> LOCATION: -> at Do_WetDep (in module GeosCore/wetscav_mod.F90)
pe=00009 FAIL at line=01358 gchp_chunk_mod.F90 <Error calling DO_WETDEP>
pe=00009 FAIL at line=03680 Chem_GridCompMod.F90 <status=1>
pe=00009 FAIL at line=02734 Chem_GridCompMod.F90 <status=1>
pe=00009 FAIL at line=01844 MAPL_Generic.F90 <Error during the 'Run' stage of the gridded component 'GCHPchem'>
pe=00009 FAIL at line=00556 GCHP_GridCompMod.F90 <status=1>
pe=00009 FAIL at line=01844 MAPL_Generic.F90 <Error during the 'Run' stage of the gridded component 'GCHP'>
pe=00009 FAIL at line=01257 MAPL_CapGridComp.F90 <status=1>
pe=00009 FAIL at line=01181 MAPL_CapGridComp.F90 <status=1>
pe=00009 FAIL at line=00804 MAPL_CapGridComp.F90 <status=1>
pe=00009 FAIL at line=00934 MAPL_CapGridComp.F90 <status=1>
pe=00009 FAIL at line=00247 MAPL_Cap.F90 <status=1>
pe=00009 FAIL at line=00211 MAPL_Cap.F90 <status=1>
pe=00009 FAIL at line=00154 MAPL_Cap.F90 <status=1>
pe=00009 FAIL at line=00129 MAPL_Cap.F90 <status=1>
pe=00009 FAIL at line=00031 GCHPctm.F90 <status=1>
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 9 in communicator MPI_COMM_WORLD
with errorcode 0.
NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
This is an issue with the time step used for the simulation, as suggested in the previous comments.
Earlier, I have been using a 3600/1800 time step. Now I changed the same into 1800/900 which solved the issue.
What institution are you from?
Wolfson Atmospheric Chemistry Laboratories
Description of the problem
For a while now I've not been able to run GCHP on my cluster. With MERRA 2, I get the following error when running in release mode:
which looks very similar to this open GC Classic error (and some closed errors).
I know whereabouts in the code it's dying, but it's not helping me figure out what's causing the problem. I've attached four archive runs, two with a debug build and two with a default build (MERRA2 and GEOS-FP). Both archives have all
DEBUG
level logging turned on and have all the individual PET log files in theLogs
directory alongside the GCHP log file.The debug build with MERRA 2 run fails differently:
which is looks relevant to this open HEMCO issue.
With GEOS-FP, the release build of the model dies as follows in the first timestep:
And the same GEOS-FP run, with a debug build:
I have tried:
I'd appreciate another set of eyes on the issues. I guess the error is propagating along from something that I'm missing (hopefully nothing too obvious...), but I'm not sure where that might be!
GEOS-Chem version
Versions greater than 13.3.x (log files generated with v14.0.0-rc.2 c85903e0)
Description of code modifications
None.
Log files
I've attached four archived runs - a debug and release build run for both MERRA2 and GEOS-FP. I'm much more invested in MERRA2 than GEOS-FP, but I've posted both just cause.
MERRA2_DEBUG_FAILURE.tar.gz MERRA2_NON_DEBUG_FAILURE.tar.gz GEOSFP_DEBUG_FAILURE.tar.gz GEOSFP_NON_DEBUG_FAILURE.tar.gz
Software versions