IO shifts time index around restarts

mahf708 commented 3 months ago

There’s a harmless (for now) break in the data indexing for both runs for instantaneous outputs. In the present-day run, this occurs around 2019-12-19 (a restart checkpoint). In the pre-industrial run, it occurs around 2019-11-15. In both instances, the following takes places:

Before the restart, the data files have signature ending with -00000.nc whereas after the restart -?????.nc where ????? is usually the time in second to the first timestep in the file. Put differently, before the restart, the time variable for the 3-hourly output stream goes from 00, 03, 06, 09, …, 21 and after the restart 03, 06, 09, 12, …, 24. In the overlapping periods (since these breaks were associated with model failures), the data points seem to be in good shape, and the right values have the right time variables associated with them.

This may or may not be problematic for analysis, depending on specific tools. It is noted here but it won’t be acted upon. It is left to the user to work around it, if desired/needed.

See internal outputs https://acme-climate.atlassian.net/wiki/spaces/EAMXX/pages/4334223933/EAMxx+ERFaer+production from a recent run using commit https://github.com/E3SM-Project/scream/commit/29bdb81 on branch https://github.com/E3SM-Project/scream/tree/mahf708-ff-a73d48a

AaronDonahue commented 3 months ago

@mahf708 I did some digging into this and I think this behavior actually is expected given the logic of restarts and file writing. I think it is an result of using force_new_file=True. I think a simple example might illustrate why.

Consider a case where you write hourly instant output with a max number of snaps per file as 24. So basically you want each file to represent one day of data.

For this case your I/O will look like

File 1 records snaps 0, 1, 2, ...., 23.
File 1 records snaps 24, 25, ...., 47.
...and so on.

The name of the file is determined by the first snap recorded. So in the above example the filenames will all have a *-00000.nc in their names.

What is happening under the hood is that at the end of the 1st day of simulation the I/O has written 23 snaps to File 1 and 1 snap to File 2. Because I/O happens at the end of the snap and a full day of simulation is 24 hours.

If you ran just a 1 day simulation you would be left with 2 files. One w/ 24 snaps (0-23) and one with 1 snap (24). The default behavior is that at restart it will reopen File 2 and fill in snaps 25-47.

When we "force new file" we tell I/O to start a new output stream on its first snap which in this case is snap 25. So it does what it is told and makes sure the filename represents the first recorded timesnap. Hence the shift in the filename.

mahf708 commented 3 months ago

That makes sense and in line with Ben's experience. However, it doesn't explain everything. The above happened without force_new_file=True. The simulations recorded here were completely finished without using the force_new_file flag: https://acme-climate.atlassian.net/wiki/spaces/EAMXX/pages/4334223933/EAMxx+ERFaer+production

bartgol commented 3 months ago

In the overlapping periods (since these breaks were associated with model failures), the data points seem to be in good shape, and the right values have the right time variables associated with them.

Are you saying that this weird behavior only occurs when the model crashed and does not appear when restarting from the end of a previous run? Can you elaborate more on how the run evolves? E.g., "write restart at t=T, run X more stesp (Y outputs written) and crash, then restart again from t=T and output files are suddenly odd"

mahf708 commented 3 months ago

In the overlapping periods (since these breaks were associated with model failures), the data points seem to be in good shape, and the right values have the right time variables associated with them.

Are you saying that this weird behavior only occurs when the model crashed and does not appear when restarting from the end of a previous run? Can you elaborate more on how the run evolves? E.g., "write restart at t=T, run X more stesp (Y outputs written) and crash, then restart again from t=T and output files are suddenly odd"

Not exactly. We actually ran into this several times, so it is not just one case and I'd argue it is a feature by now, not a bug 😉 I think it may best to point you to some log files, but here's a rough timeline:

Model runs for several months, fails and continues fine, etc. --- from Aug 1 to Dec 19, all good.
Model fails after Dec 19, and I restarted it at the last healthy checkpoint (Dec 19)
Model restarts fine from Dec 19, BFB, etc., but all INSTANT IO files are shifted by one snapshot; values/times themselves are fine across files; it is just they're shifted by one in terms of which files they are stored in (for the overlapping period, we have both sets of files)

bartgol commented 3 months ago

What is the "overlapping period"? What you mean with "both" sets of files?

So you are saying that sometimes it crashes and upon restart files look still ok, but other times it fails and upon restarts files have the shift? Does the shift ever happen upon restarting a successful run (meaning that there were no partial output files hanging around)?

mahf708 commented 3 months ago

What is the "overlapping period"? What you mean with "both" sets of files?

In my example above, recall that the restart checkpoint is Dec 19:

First run: Dec 19 to Dec 23 (system-side fail) ---> output files produced normally, everything is good, no time shifts: first set of files (signature -00000.nc)
Second run: Dec 19 to ... ---> output files produced normally, everything is good, but with time shifts: second set of files (signature -?????.nc where ????? is the number of seconds of the first entry, e.g., 00900 for 15-min output)

So we can compare the files produced on Dec 20, 21, 22 with the different signatures as they are not overwritten (all my IO is such that we have one day of data for each stream, so 4 snaps for 6-hourly output). These files have identical values and identical timestamps (in the ones I checked) but they are shifted in terms of which files they are saved in (by one).

So you are saying that sometimes it crashes and upon restart files look still ok, but other times it fails and upon restarts files have the shift?

Yes.

Does the shift ever happen upon restarting a successful run (meaning that there were no partial output files hanging around)?

Not that I have seen. However, it persists (so once it shifts, it remains shifted for the rest of the simulation, even if runs are successful).

bartgol commented 3 months ago

I found out why this happens, and I think it may require some thinking/work to fix (though I don't suspect it would be much).

The issue only arises when restarting after a fail. Consider this scenario:

100 timesteps/day (it's weird, but just for simplicity), so, time would be t=0.00,0.01,...0.99,1.00,...
we fit 100 snaps/file
write restarts every 5 days

Suppose we have a crash sometime during day 7. Since we have a restart file at t=5.00, we can restart from there. Before we restart, the output files created are

blah.2000-01-01-00000.nc: contains t=0.00,...,t=0.99
blah.2000-01-02-00000.nc: contains t=1.00,...,t=1.99
blah.2000-01-03-00000.nc: contains t=2.00,...,t=2.99
blah.2000-01-04-00000.nc: contains t=3.00,...,t=3.99
blah.2000-01-05-00000.nc: contains t=4.00,...,t=4.99
blah.2000-01-06-00000.nc: contains t=5.00,...,t=5.99

along with blah.2000-01-06-00000.rhist.nc which is used to restart history. In particular, rhist claims that the last output file is blah.2000-01-06-00000.nc, which, at the moment when rhist was written, only contained 1 snapshot (the t=5.00 snap).

Now, upon restart, looking at rhist, we see that the last output file was blah.2000-01-06-00000.nc, so we open it, in the hope of resuming to fill it. We find out that it's full, so IO will simply open a new file when the next output is requested. That happens to be the 1st timestep inside day 6 (with 100 steps/day, the fraction of day in seconds is 864s), so the new file is open with that timestamp: blah-2000-01-05-00864.nc. This explains what Naser/Ben have been seeing (with nsteps/day=96, we have the 1st timestamp of the day being 00900).

Now, let's talk solutions. The easiest one I can think of is the following. When we restart, if the last output file is full, we need to consider the case where we may have to "throw some of it away". Namely, we check the last output file: is it full?

no: all good, resume filling it
yes: look at the time var in the file: if there are N time snaps after t_restart, "decrease the file content by N". This entails modifying our (meaning on the scream side) data structures, so that we decrease the number of active records for a variable, as well as the current time dim length.

mahf708 commented 3 months ago

Now, upon restart, looking at rhist, we see that the last output file was blah.2000-01-06-00000.nc, so we open it, in the hope of resuming to fill it.

Well, as I said yesterday, this is a flaw. We shouldn't need to open the file to know if we need to keep filling it or not. We should know that ahead of time by information available in the corresponding rhist file.

bartgol commented 3 months ago

Even if we had more info added to rhist, we'd still need the same workaround. With the fix you want, we'd open blah.2000-01-06-00000.nc, and rhist would tell us it has only 1 snap (b/c that's how many snaps it had when we were writing rhist). We cannot append to the file (since it may have more than 1 snap, if the run crashed after rhist was written, like in this case), so we need to first "reset" the number of records in the file, so that pio will overwrite snaps, starting at position 2 (or 1 in C indexing).

E3SM-Project / scream

IO shifts time index around restarts #2889