Inconsistent NetCDF errors when submitting more than one run from the same GCHP build [BUG/ISSUE]

What institution are you from?

Portland State Universities' Center for Climate & Aerosol Research

Description of the problem

I am submitting multiple jobs each for different years, specifically within the bounds of two time periods, 2003 to 2005 and 2017 to 2019. My current build of GC GCHP works without any issues for the years 2018 and 2019, but when I try to submit more than one job at a time--I experience a spell of runtime errors involving variable attributes in my HEMCO data directories. Given they are running on the same build, I create individual ExtData folders for each of the years listed above--such that no overlap should occur. Attached below I've provided the log files for the 2003-run I submitted a few days ago, as this run seemingly failed the fastest as the 2004 and 2005 runs had been running for hours at that point in time.

Additionally: These runs are preludes to another task involving iterating a series of jobs that run for a target time period, that would each start one month ahead of the last run in the series (n+1) with the same end time. I wrote a python script that I am hoping will preform this desired task, but when I run the script--only the series of directories are created with no jobs being submitted? (See below):

import os

##-> Primary Workspace
#->  Create GCHP output directories for each run accordingly to their respective starting date
Root_Path = '/home/jaljbour/GCHP_NEW/ExtData/clean_run_C24_TT/OutputDir/'
Dir_List = ['Start_01_2017', 'Start_02_2017', 'Start_03_2017', 'Start_04_2017', 'Start_05_2017', 'Start_06_2017', 'Start_07_2017', 'Start_08_2017', 'Start_09_2017', 'Start_10_2017', 'Start_11_2017', 'Start_12_2017',
        'Start_01_2018', 'Start_02_2018', 'Start_03_2018', 'Start_04_2018', 'Start_05_2018', 'Start_06_2018', 'Start_07_2018', 'Start_08_2018', 'Start_09_2018', 'Start_10_2018', 'Start_11_2018', 'Start_12_2018',
        'Start_01_2019', 'Start_02_2019', 'Start_03_2019', 'Start_04_2019', 'Start_05_2019', 'Start_06_2019', 'Start_07_2019', 'Start_08_2019', 'Start_09_2019', 'Start_10_2019', 'Start_11_2019', 'Start_12_2019']
for items in Dir_List:
    path = os.path.join(Root_Path, items)
    os.mkdir(path)
break

#->  Set number of jobs to be submitted, point to respective directories, and loop 'sed' edits of the runConfig.sh for each respective run
Num_Jobs = 36

StrTi_List = ['20170101 000000', '20170201 000000', '20170301 000000', '20170401 000000', '20170501 000000', '20170601 000000', '20170701 000000', '20170801 000000', '20170901 000000', '20171001 000000', '20171101 000000', '20171201 000000',
           '20180101 000000', '20180201 000000', '20180301 000000', '20180401 000000', '20180501 000000', '20180601 000000', '20180701 000000', '20180801 000000', '20180901 000000', '20181001 000000', '20181101 000000', '20181201 000000',
           '20190101 000000', '20190201 000000', '20190301 000000', '20190401 000000', '20190501 000000', '20190601 000000', '20190701 000000', '20190801 000000', '20190901 000000', '20191001 000000', '20191101 000000', '20191201 000000',]

for j in range(0, Num_Jobs):
    sed_1 = "sed -i 's|Start_Time=.*|Start_Time=\"%s\"|' ./runConfig.sh" % StrTi_List[j]      # Edit start month in runConfig.sh
    os.system(sed_1)
    sed_2 = "sed -i 's|EXPID:.*|EXPID:\"%s\"|' ./HISTORY.rc" % Dir_List[j]                    # Edit output directory in HISTORY.rc
    os.system(sed_2)
    os.system('#SBATCH' %gchp.runscript)                                                      # Submit array of jobs 
break

Description of troubleshooting performed

I have tried to back-trace error 101 in the NetCDF4_FileFormatter.F90 file and error 33 in the NetCDF4_get_var.H file, both attempts yielded relatively fruitless results in troubleshooting. In the run output log, you can see that some of the files alluded to are housed in 2018 folders (not for all files referenced), could part of the problem? As of now, I am making a list of the all of the referenced files, such to see any overlap or trend in what could be causing this spell of runtime errors.

GEOS-Chem version

GEOS-Chem GCHP v13.0.2

Log files

Build log (if applicable): HEMCO_Config.rc.txt ExtData.rc.txt HEMCO_Diagn.rc.txt
Run logs (if applicable): gchp_5658756.out.txt allPEs.log HEMCO.log
Run script (if applicable): runConfig.sh.txt

Software versions

module load gcc-9.2.0 module load openmpi-3.0.1/gcc-9.2.0 module load netcdf/netcdf-c-4.8.0/gcc-9.2.0 module load netcdf/netcdf-cxx/4.3.1/gcc-9.2.0 module load netcdf/netcdf-fortran/4.5.3/gcc-9.2.0 module load /scratch/cbuten/GCHP/esmf_8.1.1_gcc9.2.0 module load General/hdf5/1.12.1/gcc-9.2.0 module load cmake/3.21.0/gcc-9.2.0

Hi @JbourJabber, you need to have different run directories for different jobs, unless you run serially in each run directory. This is because when you submit a batch job you are submitting the run script, not snapshots of all of the config files at the time you submit. Since you don't know when the job will be picked up all the config files need to be static. There may be ways to hack this but we don't recommend it.

Please be aware that there are updates in 14.0 that make submitting consecutive jobs easier. This is not the case you are doing, but you may enjoy the run directory updates in that version.

Thank you for that insight regarding the run script, that is very helpful.

In part 1 of the problem description, the reoccurring errors described take place despite different run directories being utilized. For example, see the image below: Seperate Run Directories

clean_run_C24_C = 2018 Run Dir
clean_run_C24_C_TT_II = 2019 Run Dir
clean_run_C24_C_TT = 2017 Run Dir
clean_run_C24_C_2004 = 2004 Run Dir
clean_run_C24_C_2005 = 2005 Run Dir
clean_run_C24_C_2003 = 2003 Run Dir

Each sub-directory is its own run directory for the respective year, and where this issue arises is when I submit more than one job all together. For example, the log files I initially attached were for the 2003 job that randomly failed when I tried to run another job from the 2005 run directory.

Hi @JbourJabber, this may be an issue with your system. Do you have a system administrator you could talk to about submitting more than one job at once?

I am closing out this issue as it seems to be a system or local scripting problem.

Hi,

I'm working with @JbourJabber to diagnose why GCHP crashes when two jobs run simultaneously in two different run directories. The error occurs in ./src/MAPL/pfio/NetCDF4_FileFormatter.F90, in the subroutine open:

subroutine open(this, file, mode, unusable, comm, info, rc)
      class (NetCDF4_FileFormatter), intent(inout) :: this
      character(len=*), intent(in) :: file
      integer, intent(in) :: mode
      class (KeywordEnforcer), optional, intent(in) :: unusable
      integer, optional, intent(in) :: comm
      integer, optional, intent(in) :: info
      integer, optional, intent(out) :: rc

      integer :: omode
      integer :: status

      select case (mode)
      case (pFIO_READ)
         omode = NF90_NOWRITE
      case (pFIO_WRITE)
         omode = NF90_WRITE
      case default
         _ASSERT(.false.,"read or write mode")
      end select

      if (present(comm)) then
         this%comm = comm
         this%parallel=.true.
      end if

      if (present(info)) then
         this%info = info
      else
         this%info = MPI_INFO_NULL
      end if

      if (this%parallel) then
         !$omp critical
         status = nf90_open(file, IOR(omode, NF90_MPIIO), comm=this%comm, info=this%info, ncid=this%ncid)
         !$omp end critical
         _VERIFY(status)
      else
         !$omp critical
         status = nf90_open(file, IOR(omode, NF90_SHARE), this%ncid)
         !$omp end critical
         _VERIFY(status)
      end if

      _RETURN(_SUCCESS)
      _UNUSED_DUMMY(unusable)
   end subroutine open

I verified that this%parallelis F in this IF-block:

if (this%parallel) then
         !$omp critical
         status = nf90_open(file, IOR(omode, NF90_MPIIO), comm=this%comm, info=this%info, ncid=this%ncid)
         !$omp end critical
         _VERIFY(status)
      else
         !$omp critical
         status = nf90_open(file, IOR(omode, NF90_SHARE), this%ncid)
         !$omp end critical
         _VERIFY(status)
      end if

so the netCDF file is opened as: status = nf90_open(file, IOR(omode, NF90_SHARE), this%ncid)

When the error is triggered nf90_open returns -101, which according to the netcdf header file is from the HDF5 layer. We have confirmed the error is not in the netCDF file, both through inspection of the file with ncview and ncdump, but also because a single GCHP job ends successfully.

IOR(omode, NF90_SHARE)returns 2048the value of NF90_SHARE, so nf90_open opens the netCDF in the NF90_SHARE mode, which according to the nf90_open docs is:

"The NF90_SHARE flag is appropriate when one process may be writing the dataset and one or more other processes reading the dataset concurrently (note that this is not the same as parallel I/O); it means that dataset accesses are not buffered and caching is limited. Since the buffering scheme is optimized for sequential access, programs that do not access data sequentially may see some performance improvement by setting the NF90_SHARE flag.

Through internet research I learned that some operating systems may lock a file when a process opens it to prevent simultaneous write and data corruption from being opened by another process. I wonder if that might be happening here, although from the description of the NF90_SHARE flag above, this mode is not the same as parallel I/O. Plus the fact that this if -clause is only entered if this%parallelis F. However we are running the job in parallel on multiple cores using openmpi, so I was surprised at first that this%parallelis F, but now I suspect this%parallel is set based on whether the netCDF libraries are configured for parallel I/O. Can you confirm this? In subroutine open above you can see that this%parallel is set based on the output of present(comm), but I was not able to determine where commis defined.

I next looked at the cmake configuration files to see if cmake determined if our system's netCDF libraries support parallel IO. In file CmakeCache.txt in the build directory I find these lines:

//NETCDF library compiled with parallel IO support
NETCDF_IS_PARALLEL:BOOL=TRUE

so it appears that the check done by cmake does verify parallel netCDF libraries. But if so, why is the 'this%parallel' flag above false?

To independently verify that our system's netCDF libraries support parallel I/O, I ran the netcdf utility nc-config --has-parallel, which indicates in fact that they aren't configured for parallel I/O. So why is cmake indicating they are?

I looked at the test cmake runs to determine support for parallel I/O. It's in .Code.GCHP/ESMA_cmake/ecbuild/cmake/contrib/FindNetCDF4.cmake:

if(${output} STREQUAL yes)
  set(HAS_HDF5 TRUE)
  set(HDF5_FIND_QUIETLY ${NETCDF_FIND_QUIETLY})
  set(HDF5_FIND_REQUIRED ${NETCDF_FIND_REQUIRED})
  find_package(HDF5)
#        list( APPEND NETCDF_LIBRARIES_DEBUG
#            ${HDF5_LIBRARIES_DEBUG} )
#        list( APPEND NETCDF_LIBRARIES_RELEASE
#            ${HDF5_LIBRARIES_RELEASE} )
  set (NETCDF_IS_PARALLEL ${HDF5_IS_PARALLEL})
endif()
_NETCDF_CONFIG (--has-pnetcdf output return)
if(${output} STREQUAL yes)
  set (NETCDF_IS_PARALLEL TRUE)
else()
#   set(NETCDF_IS_PARALLEL FALSE)
endif()
set( NETCDF_IS_PARALLEL TRUE CACHE BOOL
    "NETCDF library compiled with parallel IO support" )

NETCDF_IS_PARALLEL is set to TRUE if nc-config --has-pnetcdf returns 'yes', but isn't set to anything if it returns 'no' since that line is commented out. However above that, NETCDF_IS_PARALLEL is set to $HDF5_IS_PARALLEL. Since nc-config will return 'no' in our case, NETCDF_IS_PARALLEL must be set to TRUE based on HDF5_IS_PARALLEL. I tried to find where this variable is defined, but couldn't locate. I don't know why HDF5__IS_PARALLELwould be set to TRUE because I also verified that the HDF5 libraries on our system don't support parallel I /O either.

My best guess at this point is that GCHP requires netCDF with support for parallel I/O, although I couldn't find this in the GCHP docs. I can't completely reconcile this with the code in subroutine open above, that seems to check for parallel I/O and opens netcdf files with mode NF90_SHARE if not.

Apologies for the message length but I wanted to fully document what I found. Many thanks for any insights you can provide.

geoschem / GCHP