NGEET / fates

repository for the Functionally Assembled Terrestrial Ecosystem Simulator (FATES)
Other
105 stars 92 forks source link

FATES landuse default `fluh_timeseries` crashes `FatesColdLUH` test on izumi #1224

Closed glemieux closed 4 months ago

glemieux commented 4 months ago

This was discovered in the process of testing https://github.com/NGEET/fates/pull/1223. The FatesColdLUH2 test in the fates suite fails RUN very early into the process. From the lnd.log file it looks like the finundated read upper bound step isn't reporting the correct file that it's reading from, but I think that might be a red herring. Note that this doesn't appear to be an issue on derecho or perlmutter.

I can confirm that switching the fluh_timeseries to an older file that has a shorter time length does not present this issue. That said, the size of the file does not appear to be an issue after attempting to run the test case with a copy of the same file, but truncated to a shorter time. I will also note that the older file is formatted with the classic netcdf type, where as the newer file is cdf5. That said, I'm not sure how relevant that is as the flandusepftdat file that is used in this test does not present an issue when used in conjunction with the older fluh_timeseries file.

It is possible that the newer file, which was generated via the fates land use tool, could be introducing an issue based on an update since the initial tool development when the original default was created (the original file was created when the tool was located as part of the fates repository). Issue https://github.com/NGEET/tools-fates-landusedata/issues/5 to investigate potential causes on that side.

The log file results are below:

lnd.log

successfully initialized sdat
(shr_strdata_readstrm) opening   : /fs/cgd/csm/inputdata/lnd/clm2/paramdata/finundated_inversiondata_0.9x1.25_c170706.nc
(shr_strdata_readstrm) setting pio descriptor : /fs/cgd/csm/inputdata/lnd/clm2/paramdata/finundated_inversiondata_0.9x1.25_c170706.nc
(shr_strdata_set_stream_iodesc) setting iodesc for : FWS_TWS_A with dimlens(1), dimlens(2) =      288       192   variable as time dimension time
(shr_strdata_readstrm) reading file lb: /fs/cgd/csm/inputdata/lnd/clm2/paramdata/finundated_inversiondata_0.9x1.25_c170706.nc       1
(shr_strdata_readstrm) reading file ub: /fs/cgd/csm/inputdata/lnd/clm2/

cesm.log

Obtained 10 stack frames.
/scratch/cluster/glemieux/ctsm-tests/tests_0716-152356iz/ERS_D_Ld3.f45_f45_mg37.I2000Clm50FatesCruRsGs.izumi_nag.clm-FatesColdLUH2.0716-152356iz/bld/cesm.exe() [0x336f214]
/scratch/cluster/glemieux/ctsm-tests/tests_0716-152356iz/ERS_D_Ld3.f45_f45_mg37.I2000Clm50FatesCruRsGs.izumi_nag.clm-FatesColdLUH2.0716-152356iz/bld/cesm.exe() [0x336f748]
/scratch/cluster/glemieux/ctsm-tests/tests_0716-152356iz/ERS_D_Ld3.f45_f45_mg37.I2000Clm50FatesCruRsGs.izumi_nag.clm-FatesColdLUH2.0716-152356iz/bld/cesm.exe() [0x336fcc8]
/scratch/cluster/glemieux/ctsm-tests/tests_0716-152356iz/ERS_D_Ld3.f45_f45_mg37.I2000Clm50FatesCruRsGs.izumi_nag.clm-FatesColdLUH2.0716-152356iz/bld/cesm.exe() [0x33727f9]
/scratch/cluster/glemieux/ctsm-tests/tests_0716-152356iz/ERS_D_Ld3.f45_f45_mg37.I2000Clm50FatesCruRsGs.izumi_nag.clm-FatesColdLUH2.0716-152356iz/bld/cesm.exe(PIOc_openfile+0x11) [0x336e611]
/scratch/cluster/glemieux/ctsm-tests/tests_0716-152356iz/ERS_D_Ld3.f45_f45_mg37.I2000Clm50FatesCruRsGs.izumi_nag.clm-FatesColdLUH2.0716-152356iz/bld/cesm.exe() [0x33230e9]
/scratch/cluster/glemieux/ctsm-tests/tests_0716-152356iz/ERS_D_Ld3.f45_f45_mg37.I2000Clm50FatesCruRsGs.izumi_nag.clm-FatesColdLUH2.0716-152356iz/bld/cesm.exe() [0xa2a117]
/scratch/cluster/glemieux/ctsm-tests/tests_0716-152356iz/ERS_D_Ld3.f45_f45_mg37.I2000Clm50FatesCruRsGs.izumi_nag.clm-FatesColdLUH2.0716-152356iz/bld/cesm.exe() [0xaa737f]
/scratch/cluster/glemieux/ctsm-tests/tests_0716-152356iz/ERS_D_Ld3.f45_f45_mg37.I2000Clm50FatesCruRsGs.izumi_nag.clm-FatesColdLUH2.0716-152356iz/bld/cesm.exe() [0xb3ef07]
/scratch/cluster/glemieux/ctsm-tests/tests_0716-152356iz/ERS_D_Ld3.f45_f45_mg37.I2000Clm50FatesCruRsGs.izumi_nag.clm-FatesColdLUH2.0716-152356iz/bld/cesm.exe() [0xa8a293]
[cli_1]: aborting job:
application called MPI_Abort(MPI_COMM_WORLD, -1) - process 1
[mpiexec@i032.cgd.ucar.edu] HYDT_bscd_pbs_wait_for_completion (tools/bootstrap/external/pbs_wait.c:67): tm_poll(obit_event) failed with TM error 17002
[mpiexec@i032.cgd.ucar.edu] HYDT_bsci_wait_for_completion (tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting for completion
ekluzek commented 4 months ago

@glemieux that last bit in the error about the launcher is something that tells me to resubmit. And usually it resolves itself. I think I've maybe only had to resubmit another time for it to resolve.

But, if you are getting this consistently on every submission (but try a good four times or so) -- this must mean something real. The first thing that springs to mind is to try the intel and gnu compilers. And you might also try with fewer processors.

Hmmm....

glemieux commented 4 months ago

Closing to move this to CTSM as it is either an issue there or an issue with the fates land use data tool. See https://github.com/ESCOMP/CTSM/issues/2653 and https://github.com/NGEET/tools-fates-landusedata/issues/5, respectively.