geoschem / GCHP

The "superproject" wrapper repository for GCHP, the high-performance instance of the GEOS-Chem chemical-transport model.
https://gchp.readthedocs.io
Other
21 stars 25 forks source link

Stretched Grid Runs Failing with "Error calling DO_WETDEP" #402

Closed anasali1999 closed 1 day ago

anasali1999 commented 1 month ago

Name and Institution (Required)

Name: S. M. Anas Ali (preferred name Anas) Institution: University of Toronto, Department of Physics

Confirm you have reviewed the following documentation

Description of your issue or question

Please provide as much detail as possible. Always include the GCHP version number and any relevant configuration and log files.

I am working with GCHP 14.1.1 to perform stretched-grid runs. I am regridding a C48 cube-sphere restart-file to stretched-grid restart-files with different stretch factors. I was able to regrid the file with a stretch factor of 2.0 and successfully run the model, but all other model runs have failed within 6-11 minutes.

The failed runs all have "Error calling DO_WETDEP" in the end of their slurm outputs. I have no idea why my stretch factor of 2.0 worked, but the other runs did not, as I used the same regridding commands and the same original restart-file. Looking at the files with ncdump -h filename shows the target lat, target lon, and stretch factor correctly for all the regridded restart-files. I even used the original restart-file to perform a regular run, with no errors.

Here are the relevant files: do_wetdep_error.zip

Here is an example of my regridding commands:

source activate my_regridding_environment

gridspec-create gcs 48

gridspec-create sgcs 48 -s 4.0 -t 53.2 -126.3

ESMF_RegridWeightGen                              \
  --source      c48_gridspec.nc                     \
  --destination c48_s4d00_tc1qntuz0kwy1_gridspec.nc \
  --method      conserve                            \
  --weight      c48_to_c48_stretched_weights.nc

conda deactivate

source activate my_gcpy_environment

python -m gcpy.regrid_restart_file       \
   --stretched-grid                        \
   --stretch-factor 4.0                    \
   --target-latitude 53.2                  \
   --target-longitude -126.3              \
   GEOSChem.Restart.20180801_0000z.c48.nc4 \
   c48_to_c48_stretched_weights.nc         \
   GEOSChem.Restart.20180801_0000z.c48.nc4

Additional note: when re-compiling (with "make -j" in my build directory), I saw the following output with MECH as fullchem and USE_REAL8 as the only ON setting:

GEOS-Chem 14.1.1 (science codebase)
Current status: 14.1.1
=============================
-- Settings:
  * MECH:     fullchem  carbon  custom
  * OMP:      ON  OFF
  * USE_REAL8:    ON  OFF
  * APM:      ON  OFF
  * RRTMG:    ON  OFF
  * GTMM:     ON  OFF
  * LUO_WETDEP:   ON  OFF

Below are some of the relevant portions of the slurm output, log files, and setCommonRunSettings. Full files are attached as well. Some typical slurm output for a failed run:

+ mpirun -np 120 ./gchp
pe=00107 FAIL at line=01368    gchp_chunk_mod.F90                       <Error calling DO_WETDEP>
pe=00107 FAIL at line=02809    Chem_GridCompMod.F90                     <status=1>
pe=00107 FAIL at line=01916    Chem_GridCompMod.F90                     <status=1>
pe=00107 FAIL at line=01807    MAPL_Generic.F90                         <status=1>
pe=00107 FAIL at line=00556    GCHP_GridCompMod.F90                     <status=1>
pe=00107 FAIL at line=01807    MAPL_Generic.F90                         <status=1>
pe=00107 FAIL at line=01308    MAPL_CapGridComp.F90                     <status=1>
pe=00107 FAIL at line=01260    MAPL_CapGridComp.F90                     <status=1>
pe=00107 FAIL at line=00837    MAPL_CapGridComp.F90                     <status=1>
pe=00107 FAIL at line=00977    MAPL_CapGridComp.F90                     <status=1>
pe=00107 FAIL at line=00301    MAPL_Cap.F90                             <status=1>
pe=00107 FAIL at line=00258    MAPL_Cap.F90                             <status=1>
pe=00107 FAIL at line=00192    MAPL_Cap.F90                             <status=1>
pe=00107 FAIL at line=00169    MAPL_Cap.F90                             <status=1>
pe=00107 FAIL at line=00031    GCHPctm.F90                              <status=1>
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 107 in communicator MPI_COMM_WORLD
with errorcode 113546496.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------

real    8m54.366s
user    346m45.490s
sys     5m29.440s
++ sed 's/ /_/g' cap_restart
+ new_start_str=20180801_000000
+ [[ 20180801_000000 = \2\0\1\8\0\8\0\1\_\0\0\0\0\0\0 ]]
+ echo 'ERROR: GCHP failed to run to completion. Check the log file for more information.'
ERROR: GCHP failed to run to completion. Check the log file for more information.
+ exit 1

Some gchp.log.new output:

===============================================================================
WETDEP: ERROR at    4  12   1 for species    2 in area WASHOUT: at surface
 LS          :  T
 PDOWN       :    0.0000000000000000
 QQ          :    0.0000000000000000
 ALPHA       :    0.0000000000000000
 ALPHA2      :    0.0000000000000000
 RAINFRAC    :    0.0000000000000000
 WASHFRAC    :    0.0000000000000000
 MASS_WASH   :    0.0000000000000000
 MASS_NOWASH :    0.0000000000000000
 WETLOSS     :                        NaN
 GAINED      :    0.0000000000000000
 LOST        :    0.0000000000000000
 DSpc(NW,:)  :                        NaN   5.7426397845368100E-013   6.3316101535418800E-013   7.0609370046402670E-013   7.8934588994995541E-013   8.8164266247687730E$
 Spc(I,J,:N) :                        NaN   8.2030026393427947E-008   8.3693190198667430E-008   8.6298355806063208E-008   8.7578435072802450E-008   8.4627636921406938E$
===============================================================================

GEOS-Chem ERROR [0107]: Error encountered in wet deposition!
 --> LOCATION:  -> at SAFETY (in module GeosCore/wetscav_mod.F90)

GEOS-Chem ERROR [0107]: Error encountered in "Safety"!
 --> LOCATION:  -> at Do_Washout_at_Sfc (in module GeosCore/wetscav_mod.F90)

GEOS-Chem ERROR [0107]:
 --> LOCATION:  -> at WetDep (in module GeosCore/wetscav_mod.F90)

GEOS-Chem ERROR [0107]: Error encountered in "Wetdep"!
 --> LOCATION:  -> at Do_WetDep (in module GeosCore/wetscav_mod.F90)
     - DO_LINEAR_CHEM: Linearized chemistry at 2018/08/01 10:10
     - LINOZ_CHEM3: Doing LINOZ

Relevant sections of my setCommonRunSettings.sh

#------------------------------------------------
#   COMPUTE RESOURCES
#------------------------------------------------
# Total cores must be divisible by 6
TOTAL_CORES=120
NUM_NODES=3
NUM_CORES_PER_NODE=40

#------------------------------------------------
#   GRID RESOLUTION
#------------------------------------------------
# Integer representing number of grid cells per cubed-sphere face side
CS_RES=48

#------------------------------------------------
#   STRETCHED GRID
#------------------------------------------------
# Turn stretched grid ON/OFF. Follow these rules if ON:
#    (1) Minimum STRETCH_FACTOR value is 1.0001
#    (2) TARGET_LAT and TARGET_LON are floats containing decimal
#    (3) TARGET_LON in range [0,360)
STRETCH_GRID=ON
STRETCH_FACTOR=4.0
TARGET_LAT=53.2
TARGET_LON=-126.3

#------------------------------------------------
#    SIMULATION DURATION
#------------------------------------------------
# Format is "YYYYMMDD HHmmSS". Example: "0000100 000000" for 1 month
Run_Duration="00000100 000000"

#------------------------------------------------------------
#    GEOS-CHEM COMPONENTS
#------------------------------------------------------------
# Sets values in geoschem_config.yml
Do_Chemistry=true
Do_Advection=true
Do_Cloud_Conv=true
Do_PBL_Mixing=true
Do_Non_Local_Mixing=true
Do_DryDep=true
Do_WetDep=true

#---------------------------------------------------------------------
#    DIAGNOSTICS
#---------------------------------------------------------------------
# Auto-update settings in HISTORY.rc for specific collections (enable with ON)
AutoUpdate_Diagnostics=ON

# Instructions to auto-update diagnostics
#   1. Set AutoUpdate_Diagnostics=ON:
#   2. Set Diag_Monthly to compute monthly time-averaged values (0=OFF, 1=ON)
#   3. If Diag_Monthly=OFF:
#        3a. Set Diag_Frequency for diagnostic frequency, format "HHmmSS"
#        3b. Set Diag_Duration for file write frequency, format "HHmmSS"
#        *Note that number of hours may exceed 2 digits, e.g. 744 for 744 hrs
#   4. Edit Diag_Collections list to specify which collections to update
#
Diag_Monthly="1"
Diag_Frequency="240000"
Diag_Duration="240000"
Diag_Collections=(SpeciesConc    \
                  AerosolMass    \
                  Aerosols   \
                  Budget         \
                  Carbon         \
                  CloudConvFlux  \
                  ConcAfterChem  \
                  DryDep         \
                  DefaultCollection \
                  Emissions  \
                  JValues        \
                  KppDiags   \
                  KppARDiags     \
                  LevelEdgeDiags \
                  Metrics        \
                  ProdLoss   \
                  RadioNuclide   \
                  RRTMG          \
                  StateChm   \
                  StateMet   \
                  StratBM        \
                  Transport  \
                  WetLossConv    \
                  WetLossLS  \
)

While I wait for support, I will try recompiling and running the model in a fresh run directory. Thank you everyone!

yantosca commented 1 month ago

Thanks for writing @anasali1999. We were recently informed about a bug in GCPy cubed-sphere regridding. We will bring this into the development branch shortly but in the meantime you may want to apply the fix in https://github.com/geoschem/gcpy/pull/311 and then create a new stretched-grid restart file.

lizziel commented 1 month ago

HI @anasali1999, thanks for reaching out. Have you validated that your run using a stretch factor of 2.0 compares well with a uniform global run? If yes, I think we can rule out problems with generation of the restart file. I wonder if there is an instability running at higher grid resolution at the particular region of interest. Here are a few things to try:

  1. Visually inspect your restart file to make sure nothing looks wrong
  2. Turn off wet deposition in setCommonRunSettings.sh and rerun to see where it crashes
  3. Reduce the timesteps in setCommonRunSettings.sh. These are automatically changed if the stretched grid high resolution area is above c180. You are approaching that with c172. You can change the threshold in the config to halve both chemistry and dynamic timestep.
anasali1999 commented 1 month ago

Hello everyone, thank you for the suggestions. Sorry I seem to have closed this issue earlier by accident.

@lizziel I reduced all the time steps in setCommonRunSettings.sh by half, this lead to the same DO_WETDEP error.

I then returned the time steps to normal and turned off wet deposition; this time the run failed with "Error calling DO_CONVECTION". Here are all the files for the turned-off wet dep run: wet dep off files.zip

I have yet to inspect the restart files, but I will report back once I compare the working unstretched, working 2.0 factor stretched, and the failing stretched files.

@yantosca This fix looks easy enough for me to try, I will report back once I try it.

anasali1999 commented 1 month ago

Hello

@yantosca I applied the changes to GCPy and generated a new restart file, but the run failed again with the same DO_WETDEP error.

lizziel commented 2 weeks ago

Hi @anasali1999, have you looked through the GEOS-Chem debugging guide? It has a specific section on the issue you are encountering in DO_WETDEP. See if there are any suggestions on that page (found here) that you have not yet tried.

anasali1999 commented 1 week ago

Hello, I have seen that section of the guide but I've been following some other leads in the meantime. I'll try the debugging strategy shown in the guide

anasali1999 commented 1 day ago

Hello, I am closing this issue, my GCHP setup overall was incorrect.