[BUG/ISSUE] Unable to run nested simulation for 13.4.0

ktravis213 commented 2 years ago

What institution are you from?

NASA LaRC

Description of the problem

I am running v14.4.0 at 0.25x0.3125 over Asia, using boundary conditions at 2x25. I have not been able to get past LINOZ, see the end of my log file below. The run hangs there until the simulation runs out of time.

Description of troubleshooting performed

I have increased KMP_STACKSIZE from 500m to 800m, set -c to 24. I am not sure what else to do. My nested runs for 13.4.0 ran fine with this run script below.

GEOS-Chem version

14.0.0

Description of modifications

None

Log files

Run logs (if applicable):
Sulfur sea salt chemistry is computed in KPP
Sulfur in-cloud chemistry is computed in KPP
Photolysis is activated -- rates computed by FAST-JX
- DO_LINEAR_CHEM: Linearized chemistry at 2016/04/20 00:00 ###############################################################################
  Interpolating Linoz fields for apr
  
  ###############################################################################
- LINOZ_CHEM3: Doing LINOZ
Run script (if applicable):

-#!/bin/csh

#SBATCH -A s2199
#SBATCH -J  GC14_nest
#SBATCH -c 24
#SBATCH -N 1
#SBATCH -t 0-20:00
#SBATCH --mem=85000
#SBATCH --mail-type=END
#SBATCH --mail-user=katherine.travis@nasa.gov
#SBATCH -o out.%j #File to which standard log will be written
#SBATCH -e err.%j #File to which standard err will be written
#SBATCH --qos=long

module load cmake/3.21.0
module load comp/intel/19.1.3.304
module load  mpi/impi/19.1.3.304
module load  wrf-deps/1.0
module load nano/2.6.3

setenv FC ifort
setenv NETCDF_C_ROOT /usr/local/other/wrf-deps/intel-19.1.3.304
setenv NETCDF_Fortran_ROOT /usr/local/other/wrf-deps/intel-19.1.3.304

###############################################################################
### Sample GEOS-Chem run script for SLURM
### You can increase the number of cores with -c and memory with --mem,
### particularly if you are running at very fine resolution (e.g. nested-grid)
###############################################################################
setenv KMP_STACKSIZE 800m

# Set the proper # of threads for OpenMP
# SLURM_CPUS_PER_TASK ensures this matches the number you set with -c above
setenv OMP_NUM_THREADS $SLURM_CPUS_PER_TASK

limit stacksize     unlimited
limit descriptors   unlimited
limit datasize      unlimited
limit memoryuse     unlimited
limit filesize      unlimited
limit coredumpsize  unlimited
# Run GEOS_Chem.  The "time" command will return CPU and wall times.
# Stdout and stderr will be directed to the "GC.log" log file
# (you can change the log file name below if you wish)
srun -c $OMP_NUM_THREADS time -p ./gcclassic >> GC.log

# Exit normally
exit 0
#EOC

Software versions

CMake version: 3.21.0
Compilers (Intel or GNU, and version): intel
NetCDF version:

yantosca commented 2 years ago

Thanks for writing @ktravis213. You might need to update #SBATCH --mem=85000, as you might be running out of memory for the job. You might need 95000 or 100000.

Also do you have the slurm*.out log? That might give you an idea of what the run failure was.

ktravis213 commented 2 years ago

Thanks @yantosca. It seems like it has enough memory - here is my .out file. That's why I thought it might have something to do with the stack size?

Job Resource Usage Summary for 64787662

  Job Ended                       : Mon Nov  7 09:15:22 EST 2022

  Partition                       : compute
  Head Node                       : borgu208
  Charged to                      : s2199

  Estimated SBUs                  : 14.91

  Total Mem Requested / Allocated : 83.00G / 83.00G
  Max Real Mem Used (on any node) : 58.43G
  Max Pct Memory Used             : 70.39%

  Walltime Requested              : 20:00:00
  Walltime Used                   : 18:25:06
  Pct Walltime Used               : 92.09%

  CPUs Requested / Allocated      : 24 / 28
  Total CPU-Time Allocated        : 21-11:42:48 (Walltime Used * #CPUs)

  CPU Utilization of __last time cmd only__:
  User CPU                        : 00:00.012 of 18-10:02:00
  System CPU                      : 00:00.008 of 18-10:02:00
  Pct CPU Utilization             : 0%

msulprizio commented 2 years ago

HI @ktravis213. Can you check your HEMCO_Config.rc.gmao_metfields to see if it's using the global 0.25x0.3125 or pre-cropped "AS" files? I discovered this weekend that when selecting one of the supported met fields, we no longer seem to be adding in the nested domain token in the met field path. I wonder if this is causing your issues. At the very least it will drastically slow down your simulations if using the global high-resolution files.

ktravis213 commented 2 years ago

Thanks @msulprizio. I am using the cropped "CH" files. Maybe I will have to ask my IT folks if there is no obvious solution.

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. If there are no updates within 7 days it will be closed. You can add the "never stale" tag to prevent the Stale bot from closing this issue.

stale[bot] commented 1 year ago

Closing due to inactivity

geoschem / geos-chem