Open ndkeen opened 1 year ago
Luca B, Chris T, and I have been trying to debug this. Unable to find a reproducer at lower res than ne1024. Have tried a few other things without success and have 2 experiments in the Q.
Last night, one of those experiments worked out, which was a suggestion from Luca:
Restart:
force_new_file: true
I understand this will write more files. It must be writing data to an output file, instead of trying to save that in a restart for the next job.
Note that for a recent cess run (using the cess branch) on frontier, we forgot to include the restart force hack for some yaml files and the error seen in e3sm.log is below -- in case someone else hits this it might be a clue. Adding the restart force hack allowed it to run.
2148: terminate called after throwing an instance of 'std::logic_error'
2148: what(): /global/cfs/cdirs/e3sm/ndk/repos/se70-jul19/components/eamxx/src/share/io/scream_io_utils.cpp:66: FAIL:
2148: found
2148: Error! Restart requested, but no restart file found in 'rpointer.atm'.
2148: restart case name: output.scream.timestepINST
2148: restart file type: history restart
2148: rpointer content:
2148: ./t.gnu.F2010-SCREAMv1.ne30pg2_ne30pg2.pm-cpu.n022.om.scream.r.INSTANT.nyears_x1.0006-01-01-00000.nc
2148: t.gnu.F2010-SCREAMv1.ne30pg2_ne30pg2.pm-cpu.n022.om.scream.monthly.rhist.AVERAGE.nyears_x1.0006-01-01-00000.nc
Last night, one of those experiments worked out, which was a suggestion from Luca:
Restart: force_new_file: true
I understand this will write more files. It must be writing data to an output file, instead of trying to save that in a restart for the next job.
For clarity: the default upon restart is to resume the last nc file (assuming we did not already reach the max snap per file). All that force_new_file
does is to start a new nc file, regardless of how much data was written in the last output file.
@ndkeen I forgot whether we fixed this or not. Are we still using force_new_file: true
in our yaml files?
We used it for the Cess runs on frontier. I am not sure if there may have been change to master that might have addressed it.
Was used in these files:
frontier% grep force *yaml
scream_output.Cess.3hourlyAVG_ne120.yaml: force_new_file: true
scream_output.Cess.3hourlyINST_ne120.yaml: force_new_file: true
scream_output.Cess.3hourly_ne1024.yaml: force_new_file: true
scream_output.Cess.6hourlyAVG_ne30.yaml: force_new_file: true
scream_output.Cess.6hourlyINST_ne30.yaml: force_new_file: true
scream_output.Cess.ACI_regions_2D.yaml: force_new_file: true
scream_output.Cess.ARM_sites_2D.yaml: force_new_file: true
scream_output.Cess.ARM_sites_3D.yaml: force_new_file: true
scream_output.Cess.hourly_2Dvars.yaml: force_new_file: true
Ok, thanks. I remember we found some issue with remapping, and didn't recall if this was fixed. I hope to find the time to get to this at some point...
Revisiting this issue ...
@ndkeen would it be possible to submit one of your cess-v2-like runs like above but without the force_new_file: true
option? If not, I can try to diagnose it in other setups (e.g., decada/aerosol) or we might as well run in both setups for more info...?
For our CESS runs at ne1024 on frontier, we are trying to use some new yaml outputs. Two of the contain a horiz remap file which may be the issue here. The repo I'm using should be the
machines/frontier
branch with Luca's branch to fix an issue regarding remapping merged inbartgol/fix-coarsening-remapper-mask-handling
. The case will actually run 1 completed day (even 2 days) and write restarts, but each time I've tried to restart from those, it hangs.The new yaml outputs:
Last files written to:
If I ogin to compute node while "hung", this is where I see it: