Closed JbourJabber closed 1 year ago
Hi @JbourJabber, you need to have different run directories for different jobs, unless you run serially in each run directory. This is because when you submit a batch job you are submitting the run script, not snapshots of all of the config files at the time you submit. Since you don't know when the job will be picked up all the config files need to be static. There may be ways to hack this but we don't recommend it.
Please be aware that there are updates in 14.0 that make submitting consecutive jobs easier. This is not the case you are doing, but you may enjoy the run directory updates in that version.
Tagging @Jourdan-He
Thank you for that insight regarding the run script, that is very helpful.
In part 1 of the problem description, the reoccurring errors described take place despite different run directories being utilized. For example, see the image below:
Each sub-directory is its own run directory for the respective year, and where this issue arises is when I submit more than one job all together. For example, the log files I initially attached were for the 2003 job that randomly failed when I tried to run another job from the 2005 run directory.
Hi @JbourJabber, this may be an issue with your system. Do you have a system administrator you could talk to about submitting more than one job at once?
I am closing out this issue as it seems to be a system or local scripting problem.
Hi,
I'm working with @JbourJabber to diagnose why GCHP crashes when two jobs run simultaneously in two different run directories. The error occurs in ./src/MAPL/pfio/NetCDF4_FileFormatter.F90
, in the subroutine open
:
subroutine open(this, file, mode, unusable, comm, info, rc)
class (NetCDF4_FileFormatter), intent(inout) :: this
character(len=*), intent(in) :: file
integer, intent(in) :: mode
class (KeywordEnforcer), optional, intent(in) :: unusable
integer, optional, intent(in) :: comm
integer, optional, intent(in) :: info
integer, optional, intent(out) :: rc
integer :: omode
integer :: status
select case (mode)
case (pFIO_READ)
omode = NF90_NOWRITE
case (pFIO_WRITE)
omode = NF90_WRITE
case default
_ASSERT(.false.,"read or write mode")
end select
if (present(comm)) then
this%comm = comm
this%parallel=.true.
end if
if (present(info)) then
this%info = info
else
this%info = MPI_INFO_NULL
end if
if (this%parallel) then
!$omp critical
status = nf90_open(file, IOR(omode, NF90_MPIIO), comm=this%comm, info=this%info, ncid=this%ncid)
!$omp end critical
_VERIFY(status)
else
!$omp critical
status = nf90_open(file, IOR(omode, NF90_SHARE), this%ncid)
!$omp end critical
_VERIFY(status)
end if
_RETURN(_SUCCESS)
_UNUSED_DUMMY(unusable)
end subroutine open
I verified that this%parallel
is F
in this IF-block:
if (this%parallel) then
!$omp critical
status = nf90_open(file, IOR(omode, NF90_MPIIO), comm=this%comm, info=this%info, ncid=this%ncid)
!$omp end critical
_VERIFY(status)
else
!$omp critical
status = nf90_open(file, IOR(omode, NF90_SHARE), this%ncid)
!$omp end critical
_VERIFY(status)
end if
so the netCDF file is opened as:
status = nf90_open(file, IOR(omode, NF90_SHARE), this%ncid)
When the error is triggered nf90_open
returns -101
, which according to the netcdf header file is from the HDF5 layer. We have confirmed the error is not in the netCDF file, both through inspection of the file with ncview and ncdump, but also because a single GCHP job ends successfully.
IOR(omode, NF90_SHARE)
returns 2048
the value of NF90_SHARE
, so nf90_open
opens the netCDF in the NF90_SHARE
mode, which according to the nf90_open
docs is:
"The NF90_SHARE flag is appropriate when one process may be writing the dataset and one or more other processes reading the dataset concurrently (note that this is not the same as parallel I/O); it means that dataset accesses are not buffered and caching is limited. Since the buffering scheme is optimized for sequential access, programs that do not access data sequentially may see some performance improvement by setting the NF90_SHARE flag.
Through internet research I learned that some operating systems may lock a file when a process opens it to prevent simultaneous write and data corruption from being opened by another process. I wonder if that might be happening here, although from the description of the NF90_SHARE flag above, this mode is not the same as parallel I/O. Plus the fact that this if -clause is only entered if this%parallel
is F
. However we are running the job in parallel on multiple cores using openmpi, so I was surprised at first that this%parallel
is F
, but now I suspect this%parallel
is set based on whether the netCDF libraries are configured for parallel I/O. Can you confirm this? In subroutine open
above you can see that this%parallel
is set based on the output of present(comm)
, but I was not able to determine where comm
is defined.
I next looked at the cmake
configuration files to see if cmake
determined if our system's netCDF libraries support parallel IO. In file CmakeCache.txt
in the build
directory I find these lines:
//NETCDF library compiled with parallel IO support
NETCDF_IS_PARALLEL:BOOL=TRUE
so it appears that the check done by cmake
does verify parallel netCDF libraries. But if so, why is the 'this%parallel
' flag above false?
To independently verify that our system's netCDF libraries support parallel I/O, I ran the netcdf utility nc-config --has-parallel
, which indicates in fact that they aren't configured for parallel I/O. So why is cmake indicating they are?
I looked at the test cmake runs to determine support for parallel I/O. It's in .Code.GCHP/ESMA_cmake/ecbuild/cmake/contrib/FindNetCDF4.cmake:
if(${output} STREQUAL yes)
set(HAS_HDF5 TRUE)
set(HDF5_FIND_QUIETLY ${NETCDF_FIND_QUIETLY})
set(HDF5_FIND_REQUIRED ${NETCDF_FIND_REQUIRED})
find_package(HDF5)
# list( APPEND NETCDF_LIBRARIES_DEBUG
# ${HDF5_LIBRARIES_DEBUG} )
# list( APPEND NETCDF_LIBRARIES_RELEASE
# ${HDF5_LIBRARIES_RELEASE} )
set (NETCDF_IS_PARALLEL ${HDF5_IS_PARALLEL})
endif()
_NETCDF_CONFIG (--has-pnetcdf output return)
if(${output} STREQUAL yes)
set (NETCDF_IS_PARALLEL TRUE)
else()
# set(NETCDF_IS_PARALLEL FALSE)
endif()
set( NETCDF_IS_PARALLEL TRUE CACHE BOOL
"NETCDF library compiled with parallel IO support" )
NETCDF_IS_PARALLEL
is set to TRUE
if nc-config --has-pnetcdf
returns 'yes', but isn't set to anything if it returns 'no' since that line is commented out. However above that, NETCDF_IS_PARALLEL
is set to $HDF5_IS_PARALLEL
. Since nc-config
will return 'no' in our case, NETCDF_IS_PARALLEL
must be set to TRUE
based on HDF5_IS_PARALLEL
. I tried to find where this variable is defined, but couldn't locate. I don't know why HDF5__IS_PARALLEL
would be set to TRUE
because I also verified that the HDF5 libraries on our system don't support parallel I /O either.
My best guess at this point is that GCHP requires netCDF with support for parallel I/O, although I couldn't find this in the GCHP docs. I can't completely reconcile this with the code in subroutine open
above, that seems to check for parallel I/O and opens netcdf files with mode NF90_SHARE if not.
Apologies for the message length but I wanted to fully document what I found. Many thanks for any insights you can provide.
What institution are you from?
Portland State Universities' Center for Climate & Aerosol Research
Description of the problem
I am submitting multiple jobs each for different years, specifically within the bounds of two time periods, 2003 to 2005 and 2017 to 2019. My current build of GC GCHP works without any issues for the years 2018 and 2019, but when I try to submit more than one job at a time--I experience a spell of runtime errors involving variable attributes in my HEMCO data directories. Given they are running on the same build, I create individual ExtData folders for each of the years listed above--such that no overlap should occur. Attached below I've provided the log files for the 2003-run I submitted a few days ago, as this run seemingly failed the fastest as the 2004 and 2005 runs had been running for hours at that point in time.
Additionally: These runs are preludes to another task involving iterating a series of jobs that run for a target time period, that would each start one month ahead of the last run in the series (n+1) with the same end time. I wrote a python script that I am hoping will preform this desired task, but when I run the script--only the series of directories are created with no jobs being submitted? (See below):
Description of troubleshooting performed
I have tried to back-trace error 101 in the NetCDF4_FileFormatter.F90 file and error 33 in the NetCDF4_get_var.H file, both attempts yielded relatively fruitless results in troubleshooting. In the run output log, you can see that some of the files alluded to are housed in 2018 folders (not for all files referenced), could part of the problem? As of now, I am making a list of the all of the referenced files, such to see any overlap or trend in what could be causing this spell of runtime errors.
GEOS-Chem version
GEOS-Chem GCHP v13.0.2
Log files
Build log (if applicable): HEMCO_Config.rc.txt ExtData.rc.txt HEMCO_Diagn.rc.txt
Run logs (if applicable): gchp_5658756.out.txt allPEs.log HEMCO.log
Run script (if applicable): runConfig.sh.txt
Software versions
module load gcc-9.2.0 module load openmpi-3.0.1/gcc-9.2.0 module load netcdf/netcdf-c-4.8.0/gcc-9.2.0 module load netcdf/netcdf-cxx/4.3.1/gcc-9.2.0 module load netcdf/netcdf-fortran/4.5.3/gcc-9.2.0 module load /scratch/cbuten/GCHP/esmf_8.1.1_gcc9.2.0 module load General/hdf5/1.12.1/gcc-9.2.0 module load cmake/3.21.0/gcc-9.2.0