Problem restarting after writing "new" yaml outputs that use horiz remapping file with ne1024 on frontier

ndkeen commented 1 year ago

For our CESS runs at ne1024 on frontier, we are trying to use some new yaml outputs. Two of the contain a horiz remap file which may be the issue here. The repo I'm using should be the machines/frontier branch with Luca's branch to fix an issue regarding remapping merged in bartgol/fix-coarsening-remapper-mask-handling. The case will actually run 1 completed day (even 2 days) and write restarts, but each time I've tried to restart from those, it hangs.

The new yaml outputs:

    ./atmchange output_yaml_files="/lustre/orion/cli115/proj-shared/terai/Cess/v1_output/scream_output.Cess.23hourly_QcQiNcNi.yaml"
    ./atmchange output_yaml_files+="/lustre/orion/cli115/proj-shared/terai/Cess/v1_output/scream_output.Cess.23hourly_QrNrQmBm.yaml"
    ./atmchange output_yaml_files+="/lustre/orion/cli115/proj-shared/terai/Cess/v1_output/scream_output.Cess.3hourlyAVG_ne120.yaml"
    ./atmchange output_yaml_files+="/lustre/orion/cli115/proj-shared/terai/Cess/v1_output/scream_output.Cess.3hourlyINST_ne120.yaml"
    ./atmchange output_yaml_files+="/lustre/orion/cli115/proj-shared/terai/Cess/v1_output/scream_output.Cess.hourly_2Dvars.yaml"
    ./atmchange output_yaml_files+="/lustre/orion/cli115/proj-shared/terai/Cess/v1_output/scream_output.Cess.monthly_ne1024.yaml"

Last files written to:

-rw-rw-r-- 1 noel cli115         16411 Jun 29 11:23 homme_atm.log.1365799.230629-112302
-rw-r--r-- 1 noel cli115            47 Jun 29 11:25 mass.out
-rwxr-xr-t 1 noel cli115  105545467928 Jun 29 11:49 output.scream.23hourly_QcQiNcNi.INSTANT.nhours_x23.2019-08-01-00000.nc*
-rw-rw-r-- 1 noel cli115        218903 Jun 29 11:49 e3sm.log.1365799.230629-112302

Last lines in e3sm log:
    0: Note: nsplit=-1, while nsplit must be >=1. We know SCREAM does not know nsplit until runtime, so this is fine.
    0:       Make sure nsplit is set to a valid value before calling prim_advance_subcycle!
    0: gfr> nelemd 384 qsize 10
    0: compose> nelemd 384 qsize 10 hv_q 1 hv_subcycle_q 6 lim 9 independent_time_steps 1
    0:     P3_INIT (reading/creating look-up tables) ...
    0:

If I ogin to compute node while "hung", this is where I see it:

#0  0x00007fc8c9c400ef in pwrite64 () from /lib64/libpthread.so.0
#1  0x00007fc8cd817fc3 in ADIOI_CRAY_WriteContig () from /opt/cray/pe/lib64/libmpi_cray.so.12
#2  0x00007fc8cd81d4bc in ADIOI_CRAY_WriteStridedColl () from /opt/cray/pe/lib64/libmpi_cray.so.12
#3  0x00007fc8cd7ede59 in MPIOI_File_write_all () from /opt/cray/pe/lib64/libmpi_cray.so.12
#4  0x00007fc8cd7ef791 in PMPI_File_write_at_all () from /opt/cray/pe/lib64/libmpi_cray.so.12
#5  0x00007fc8cf7918a7 in move_file_block () from /opt/cray/pe/parallel-netcdf/1.12.3.1/crayclang/14.0/lib/libpnetcdf_crayclang.so.4
#6  0x00007fc8cf791403 in move_record_vars () from /opt/cray/pe/parallel-netcdf/1.12.3.1/crayclang/14.0/lib/libpnetcdf_crayclang.so.4
#7  0x00007fc8cf790d6f in ncmpio.enddef () from /opt/cray/pe/parallel-netcdf/1.12.3.1/crayclang/14.0/lib/libpnetcdf_crayclang.so.4
#8  0x00007fc8cf6d4f43 in ncmpi_enddef () from /opt/cray/pe/parallel-netcdf/1.12.3.1/crayclang/14.0/lib/libpnetcdf_crayclang.so.4
#9  0x0000000001c4196d in pioc_change_def ()
#10 0x0000000001e6eb72 in eam_pio_enddef$scream_scorpio_interface_ ()
#11 0x0000000001e8cce6 in eam_pio_enddef_c2f ()
#12 0x0000000001e8a380 in scream::scorpio::eam_pio_enddef(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) ()
#13 0x0000000001e94e90 in scream::OutputManager::setup_file(scream::IOFileSpecs&, scream::IOControl const&) ()
#14 0x0000000001e90adf in scream::OutputManager::setup(ekat::Comm const&, ekat::ParameterList const&, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::shared_ptr<scream::FieldManager>, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::shared_ptr<scream::FieldManager> > > > const&, std::shared_ptr<scream::GridsManager const> const&, scream::util::TimeStamp const&, scream::util::TimeStamp const&, bool) ()
#15 0x0000000001ceaa02 in scream::control::AtmosphereDriver::initialize_output_managers() ()
#16 0x00000000006199eb in scream_init_atm ()
#17 0x0000000000614a4a in atm_init_mct$atm_comp_mct_ ()
#18 0x000000000046ade0 in component_init_cc$component_mod_ ()
#19 0x0000000000437cde in cime_init$cime_comp_mod_ ()
#20 0x0000000000468963 in main ()

/lustre/orion/cli115/proj-shared/noel/e3sm_scratch/maf-jun26/t.maf-jun26.F2010-SCREAMv1.ne1024pg2_ne1024pg2.frontier-scream-gpu.n2048t8x6.vth200.era2019.SST.newo.cice0

ndkeen commented 1 year ago

Luca B, Chris T, and I have been trying to debug this. Unable to find a reproducer at lower res than ne1024. Have tried a few other things without success and have 2 experiments in the Q.

Last night, one of those experiments worked out, which was a suggestion from Luca:

Restart:
  force_new_file: true

I understand this will write more files. It must be writing data to an output file, instead of trying to save that in a restart for the next job.

ndkeen commented 1 year ago

Note that for a recent cess run (using the cess branch) on frontier, we forgot to include the restart force hack for some yaml files and the error seen in e3sm.log is below -- in case someone else hits this it might be a clue. Adding the restart force hack allowed it to run.

2148: terminate called after throwing an instance of 'std::logic_error'
2148:   what():  /global/cfs/cdirs/e3sm/ndk/repos/se70-jul19/components/eamxx/src/share/io/scream_io_utils.cpp:66: FAIL:
2148: found
2148: Error! Restart requested, but no restart file found in 'rpointer.atm'.
2148:    restart case name: output.scream.timestepINST
2148:    restart file type: history restart
2148:    rpointer content:
2148: ./t.gnu.F2010-SCREAMv1.ne30pg2_ne30pg2.pm-cpu.n022.om.scream.r.INSTANT.nyears_x1.0006-01-01-00000.nc
2148: t.gnu.F2010-SCREAMv1.ne30pg2_ne30pg2.pm-cpu.n022.om.scream.monthly.rhist.AVERAGE.nyears_x1.0006-01-01-00000.nc

bartgol commented 1 year ago

Last night, one of those experiments worked out, which was a suggestion from Luca:
Restart:
  force_new_file: true
I understand this will write more files. It must be writing data to an output file, instead of trying to save that in a restart for the next job.

For clarity: the default upon restart is to resume the last nc file (assuming we did not already reach the max snap per file). All that force_new_file does is to start a new nc file, regardless of how much data was written in the last output file.

bartgol commented 11 months ago

@ndkeen I forgot whether we fixed this or not. Are we still using force_new_file: true in our yaml files?

ndkeen commented 11 months ago

We used it for the Cess runs on frontier. I am not sure if there may have been change to master that might have addressed it.

Was used in these files:

frontier% grep force *yaml
scream_output.Cess.3hourlyAVG_ne120.yaml:  force_new_file: true
scream_output.Cess.3hourlyINST_ne120.yaml:  force_new_file: true
scream_output.Cess.3hourly_ne1024.yaml:  force_new_file: true
scream_output.Cess.6hourlyAVG_ne30.yaml:  force_new_file: true
scream_output.Cess.6hourlyINST_ne30.yaml:  force_new_file: true
scream_output.Cess.ACI_regions_2D.yaml:  force_new_file: true
scream_output.Cess.ARM_sites_2D.yaml:  force_new_file: true
scream_output.Cess.ARM_sites_3D.yaml:  force_new_file: true
scream_output.Cess.hourly_2Dvars.yaml:  force_new_file: true

bartgol commented 11 months ago

Ok, thanks. I remember we found some issue with remapping, and didn't recall if this was fixed. I hope to find the time to get to this at some point...

mahf708 commented 2 days ago

Revisiting this issue ...

@ndkeen would it be possible to submit one of your cess-v2-like runs like above but without the force_new_file: true option? If not, I can try to diagnose it in other setups (e.g., decada/aerosol) or we might as well run in both setups for more info...?

E3SM-Project / scream

Problem restarting after writing "new" yaml outputs that use horiz remapping file with ne1024 on frontier #2411