E3SM-Project / E3SM

Energy Exascale Earth System Model source code. NOTE: use "maint" branches for your work. Head of master is not validated.
https://docs.e3sm.org/E3SM
Other
352 stars 363 forks source link

Bug: short term archiving moves MPAS hybrid restart files #4424

Closed golaz closed 3 years ago

golaz commented 3 years ago

There is a bug in short term archiving when it archives MPAS restart files. Instead of only moving restart files from the current simulation, it will also move restart files in the run sub-directory that are neded for a hybrid restart. This then causes the model to fail upon restart.

Specific example:

Case name of the simulation: v2.LR.abrupt-4xCO2_0101. This simulation is configured as a hybrid restart at year 0101 from v2.LR.piControl.

When short term archiving year 0101, the following two files

v2.LR.piControl.mpaso.rst.0101-01-01_00000.nc
v2.LR.piControl.mpassi.rst.0101-01-01_00000.nc

are moved out of the run/ sub-directory and into:

v2.LR.abrupt-4xCO2_0101/archive/rest/0101-01-01-00000/

The next time the model tries to restart, it fails:

ERROR:  ERROR mpassi buildnml: missing specified restart file for branch or hybrid run: /lcrc/group/e3sm/ac.golaz/E3SMv2/v2.LR.abrupt-4xCO2_0101/run/v2.LR.piControl.mpassi.rst.0101-01-01_00000.nc

We should fix this for v2.0. (I recall running into the same problem during the v1 Water Cycle simulation campaign).

rljacob commented 3 years ago

If you start a run as a hybrid, the next submission after the first one should just be a regular restart I thought. So it should only need the restarts written at the end of the first run.

golaz commented 3 years ago

Don't forget that MPAS is "special", It will read those files no matter what; I think to get information about the mesh or something else. Anyway, short term archiving should not be moving these files.

jonbob commented 3 years ago

@golaz - with help from @jgfouca, I have a fix for this issue but need to test it a bit more. It will require a CIME PR so it may take a little time to get merged into E3SM, but I'll push on it as much as possible.

rljacob commented 3 years ago

For the record, what is MPAS reading those files for?

jonbob commented 3 years ago

It's a weird thing for a hybrid case, so those files are not read in as "restart" files but instead are "input" files. So it's where the components get grid information, etc, though it could also obviously read that information from the actual restart files once the initial run has completed. I think the early design was that restart files wouldn't have to carry all that extra information. The fix is easy enough, almost done testing. I think the short-term archiver didn't get updated when we prepended the casename onto all the output files, so the fix is to have the archiver only move files with the casename, as it does with other components.

jonbob commented 3 years ago

Created CIME PR#4057 to fix this issue. We can make an E3SM PR to bring in the new version of CIME with the fix when it's ready