Open LarissaReames-NOAA opened 1 month ago
@LarissaReames-NOAA @junwang-noaa We have a proposed fix for this issue now. I reached out to Tony Craig and he was able to reproduce the issue in standalone CICE and quickly zero'd in on the problem/solution. He was able to generate 8700 files in standalone testing. I'll make a test branch and hopefully one of us can try it out and ensure it works.
I've tested Tony's fix (https://github.com/DeniseWorthen/CICE/tree/bugfix/manyfiles) using the C48-5deg case on Gaea. I was able to create 1906 hourly history files before hitting the wall clock time (8hours). So I think I have a fix, although the exact implementation may change a bit.
Description
Using CICE in a S2S configuration in ufs-weather-model causes failures after a large number of CICE file (restart and/or history) writes (500-700ish) when CICE is compiled with PIO but not with NetCDF. The failure always happens on a CICE process. The current work around for weather model regression tests have been to set
export I_MPI_SHM_HEAP_VSIZE=16384
in the job submission script, but this is not a long-term solution.To Reproduce:
Additional context
Cause of issue first reported in weather model issue 2320
I've also tried all possible options of restart/history_format in ice_in and the failure is always the same.
Output
On Hera the failure looks like:
On WCOSS2 and Gaea the error looks like