E3SM-Project / scream

Fork of E3SM used to develop exascale global atmosphere model written in C++
https://e3sm-project.github.io/scream/
Other
79 stars 56 forks source link

PIO: FATAL ERROR: Aborting. Reading variable (ice_cov, varid=5) Unknown error occurs in reading file for ne256 case #2318

Closed ndkeen closed 1 year ago

ndkeen commented 1 year ago

I've been trying to reach 13 months with a ne256 case. -compset F2010-SCREAMv1 --res ne256pg2_ne256pg2 Using a repo from March 27th (same repo used for the ne120 cases)

At model date = 00011207, we hit the following error:

  4: PIO: FATAL ERROR: Aborting... An error occured, Reading variable (ice_cov, varid=5) from file (/global/cfs/cdirs/e3sm/inputdata/ocn/docn7/SSTDATA/sst_ice_CMIP6_HighResMIP_E3SM_0.25x0.25_2010_clim_c20190125_intoisst.nc, ncid=66) failed with \
PIO_IOTYPE_PNETCDF iotype. The low level (PnetCDF) I/O library call failed to read the variable (Number of regions = 1, iodesc id = 565, Bytes to read on this process = 10080). Unknown error occurs in reading file (err=-205). Aborting since the \
error handler was set to PIO_INTERNAL_ERROR... (/global/cfs/cdirs/e3sm/ndk/repos/se62-mar27/externals/scorpio/src/clib/pio_darray_int.c: 1223)
  4: Obtained 10 stack frames.
  4: /pscratch/sd/n/ndk/e3sm_scratch/pm-gpu/se62-mar27/t.se62-mar27.F2010-SCREAMv1.ne256pg2_ne256pg2.pm-gpu.n096t4xX.L128.vth200/build/e3sm.exe() [0xf80d15]
  4: /pscratch/sd/n/ndk/e3sm_scratch/pm-gpu/se62-mar27/t.se62-mar27.F2010-SCREAMv1.ne256pg2_ne256pg2.pm-gpu.n096t4xX.L128.vth200/build/e3sm.exe() [0xf80e5e]
  4: /pscratch/sd/n/ndk/e3sm_scratch/pm-gpu/se62-mar27/t.se62-mar27.F2010-SCREAMv1.ne256pg2_ne256pg2.pm-gpu.n096t4xX.L128.vth200/build/e3sm.exe() [0xf8105f]
  4: /pscratch/sd/n/ndk/e3sm_scratch/pm-gpu/se62-mar27/t.se62-mar27.F2010-SCREAMv1.ne256pg2_ne256pg2.pm-gpu.n096t4xX.L128.vth200/build/e3sm.exe() [0xfaac0a]
  4: /pscratch/sd/n/ndk/e3sm_scratch/pm-gpu/se62-mar27/t.se62-mar27.F2010-SCREAMv1.ne256pg2_ne256pg2.pm-gpu.n096t4xX.L128.vth200/build/e3sm.exe() [0xfa7ec8]
  4: /pscratch/sd/n/ndk/e3sm_scratch/pm-gpu/se62-mar27/t.se62-mar27.F2010-SCREAMv1.ne256pg2_ne256pg2.pm-gpu.n096t4xX.L128.vth200/build/e3sm.exe() [0xf7a43d]
  4: /pscratch/sd/n/ndk/e3sm_scratch/pm-gpu/se62-mar27/t.se62-mar27.F2010-SCREAMv1.ne256pg2_ne256pg2.pm-gpu.n096t4xX.L128.vth200/build/e3sm.exe() [0xe5d1e8]
  4: /pscratch/sd/n/ndk/e3sm_scratch/pm-gpu/se62-mar27/t.se62-mar27.F2010-SCREAMv1.ne256pg2_ne256pg2.pm-gpu.n096t4xX.L128.vth200/build/e3sm.exe() [0xe64402]
  4: /pscratch/sd/n/ndk/e3sm_scratch/pm-gpu/se62-mar27/t.se62-mar27.F2010-SCREAMv1.ne256pg2_ne256pg2.pm-gpu.n096t4xX.L128.vth200/build/e3sm.exe() [0xec45b8]
  4: /pscratch/sd/n/ndk/e3sm_scratch/pm-gpu/se62-mar27/t.se62-mar27.F2010-SCREAMv1.ne256pg2_ne256pg2.pm-gpu.n096t4xX.L128.vth200/build/e3sm.exe() [0xcbf5f0]
  4: MPICH ERROR [Rank 4] [job id 8409658.0] [Wed May  3 20:16:48 2023] [nid001036] - Abort(-1) (rank 4 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, -1) - process 4
  4:
  4: MPI error (MPI_File_read_at_all) : Other I/O error , error stack:
  4: ADIOI_GEN_READCONTIG(90): Other I/O error Stale file handle
  4: aborting job:
  4: application called MPI_Abort(MPI_COMM_WORLD, -1) - process 4
  4: Kokkos::Cuda ERROR: Failed to call Kokkos::Cuda::finalize()
  4:
  4: Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

/pscratch/sd/n/ndk/e3sm_scratch/pm-gpu/se62-mar27/t.se62-mar27.F2010-SCREAMv1.ne256pg2_ne256pg2.pm-gpu.n096t4xX.L128.vth200
jayeshkrishna commented 1 year ago

Is this a reproducible error (looks like a filesystem issue)?

dqwu commented 1 year ago

"Other I/O error Stale file handle" looks a filesystem issue to me.

ndkeen commented 1 year ago

Yes, that was my first thought -- filesystem issue -- so I resubmitted. But wanted to save the info somewhere and started an issue. Specifically looks like issue reading from CFS -- which was acting up earlier yesterday on login node.

ndkeen commented 1 year ago

The next job in this case finally ran and seems ok. Will assume this was a temporary issue with CFS filesystem.