NOAA-EMC / CICE

Development repository for the CICE sea-ice model
Other
1 stars 13 forks source link

Coupled weather model forecasts fail after large # of file writes when CICE is compiled using PIO #94

Open LarissaReames-NOAA opened 1 month ago

LarissaReames-NOAA commented 1 month ago

Description

Using CICE in a S2S configuration in ufs-weather-model causes failures after a large number of CICE file (restart and/or history) writes (500-700ish) when CICE is compiled with PIO but not with NetCDF. The failure always happens on a CICE process. The current work around for weather model regression tests have been to set export I_MPI_SHM_HEAP_VSIZE=16384 in the job submission script, but this is not a long-term solution.

To Reproduce:

  1. Compile weather model with ATM+ICE+OCN on Hera, Gaea, or WCOSS2. Have used multiple different weather model regression test configurations and resolutions (cpld_control_c48, cpld_control_nowave_noaero_p8) and stack-stack versions/intel compilers (2021v2023) with similar results.
  2. Either run very long simulations with infrequent output or shorter simulations with high frequency output.
  3. Experience failure after 500-700 files written.

Additional context

Cause of issue first reported in weather model issue 2320

I've also tried all possible options of restart/history_format in ice_in and the failure is always the same.

Output

On Hera the failure looks like:

73: Assertion failed in file ../../src/mpid/ch4/src/intel/ch4_shm_coll.c at line 2266: comm->shm_numa_layout[my_numa_node].base_addr
73: /apps/oneapi/mpi/2021.5.1/lib/release/libmpi.so.12(MPL_backtrace_show+0x1c) [0x150a7a430bcc]
73: /apps/oneapi/mpi/2021.5.1/lib/release/libmpi.so.12(MPIR_Assert_fail+0x21) [0x150a79e0adf1]
73: /apps/oneapi/mpi/2021.5.1/lib/release/libmpi.so.12(+0x2b1eb9) [0x150a79ad9eb9]
73: /apps/oneapi/mpi/2021.5.1/lib/release/libmpi.so.12(+0x176584) [0x150a7999e584]
73: /apps/oneapi/mpi/2021.5.1/lib/release/libmpi.so.12(+0x17a9f9) [0x150a799a29f9]
73: /apps/oneapi/mpi/2021.5.1/lib/release/libmpi.so.12(+0x199a60) [0x150a799c1a60]
73: /apps/oneapi/mpi/2021.5.1/lib/release/libmpi.so.12(+0x1717ec) [0x150a799997ec]
73: /apps/oneapi/mpi/2021.5.1/lib/release/libmpi.so.12(+0x2b4387) [0x150a79adc387]
73: /apps/oneapi/mpi/2021.5.1/lib/release/libmpi.so.12(PMPI_Allreduce+0x561) [0x150a799376e1]
73: /apps/oneapi/mpi/2021.5.1/lib/release/libmpi.so.12(MPI_File_open+0x17d) [0x150a7a4492bd]
73: /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/envs/unified-env-rocky8/install/intel/2021.5.0/parallel-netcdf-1.12.2-cwokdeb/lib/libpnetcdf.so.4(ncmpio_create+0x199) [0x150a73e592c9]
73: /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/envs/unified-env-rocky8/install/intel/2021.5.0/parallel-netcdf-1.12.2-cwokdeb/lib/libpnetcdf.so.4(ncmpi_create+0x4e7) [0x150a73daf4a7]
73: /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/envs/unified-env-rocky8/install/intel/2021.5.0/parallelio-2.5.10-2wulfav/lib/libpioc.so(PIOc_createfile_int+0x2e6) [0x150a7c436696]
73: /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/envs/unified-env-rocky8/install/intel/2021.5.0/parallelio-2.5.10-2wulfav/lib/libpioc.so(PIOc_createfile+0x41) [0x150a7c432451]
73: /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/envs/unified-env-rocky8/install/intel/2021.5.0/parallelio-2.5.10-2wulfav/lib/libpiof.so(piolib_mod_mp_createfile_+0x25e) [0x150a7c1caabe]

On WCOSS2 and Gaea the error looks like

17: MPICH ERROR [Rank 17] [job id 135188771.0] [Mon Oct  7 17:37:30 2024] [c5n1294] - Abort(806965007) (rank 17 in comm 0): Fatal error in PMPI_Comm_split: Other MPI error, error stack:
17: PMPI_Comm_split(513)................: MPI_Comm_split(comm=0xc400314e, color=1, key=0, new_comm=0x7ffe5bbb2d74) failed
17: PMPI_Comm_split(494)................:
17: MPIR_Comm_split_impl(268)...........:
17: MPIR_Get_contextid_sparse_group(610): Too many communicators (0/2048 free on this process; ignore_id=0)
DeniseWorthen commented 1 week ago

@LarissaReames-NOAA @junwang-noaa We have a proposed fix for this issue now. I reached out to Tony Craig and he was able to reproduce the issue in standalone CICE and quickly zero'd in on the problem/solution. He was able to generate 8700 files in standalone testing. I'll make a test branch and hopefully one of us can try it out and ensure it works.

DeniseWorthen commented 1 week ago

I've tested Tony's fix (https://github.com/DeniseWorthen/CICE/tree/bugfix/manyfiles) using the C48-5deg case on Gaea. I was able to create 1906 hourly history files before hitting the wall clock time (8hours). So I think I have a fix, although the exact implementation may change a bit.