MPAS-Dev / MPAS

Repository for private MPAS development prior to the MPAS v6.0 release.
Other
4 stars 0 forks source link

streams output timing bug when dt has minutes and seconds #272

Closed mark-petersen closed 9 years ago

mark-petersen commented 9 years ago

In a test with

    config_dt = '00:33:20'

and monthly restart output

<immutable_stream name="restart"
                  type="input;output"
                  filename_template="restarts/restart.$Y-$M-$D_$h.$m.$s.nc"
                  filename_interval="output_interval"
                  reference_time="0000-01-01_00:00:00"
                  clobber_mode="truncate"
                  input_interval="initial_only"
                  output_interval="00-01-00_00:00:00"/>

The monthly output of restart files is highly variable and incorrect. Some files have a large number of entries, and the xtime variable does not correspond to the file name. Note that 33:20 is 2000 seconds, which divides evenly into a day, so output should be exactly at day boundaries. The output of a 240km global is:

-rw-rw-r-- 1 mpeterse mpeterse  15M Dec  3 08:27 restart.0000-02-01_00.00.00.nc
-rw-rw-r-- 1 mpeterse mpeterse 1.6G Dec  3 08:32 restart.0000-03-01_00.00.00.nc
-rw-rw-r-- 1 mpeterse mpeterse 8.9M Dec  3 08:36 restart.0000-04-01_00.00.00.nc
-rw-rw-r-- 1 mpeterse mpeterse 528M Dec  3 08:40 restart.0000-05-01_00.00.00.nc
-rw-rw-r-- 1 mpeterse mpeterse 8.9M Dec  3 08:44 restart.0000-06-01_00.00.00.nc
-rw-rw-r-- 1 mpeterse mpeterse 528M Dec  3 08:48 restart.0000-07-01_00.00.00.nc
-rw-rw-r-- 1 mpeterse mpeterse 8.9M Dec  3 08:52 restart.0000-08-01_00.00.00.nc
-rw-rw-r-- 1 mpeterse mpeterse 8.9M Dec  3 08:56 restart.0000-09-01_00.00.00.nc
-rw-rw-r-- 1 mpeterse mpeterse 8.9M Dec  3 09:00 restart.0000-10-01_00.00.00.nc
-rw-rw-r-- 1 mpeterse mpeterse 8.9M Dec  3 09:03 restart.0000-11-01_00.00.00.nc
-rw-rw-r-- 1 mpeterse mpeterse 8.9M Dec  3 09:07 restart.0000-12-01_00.00.00.nc
-rw-rw-r-- 1 mpeterse mpeterse 8.9M Dec  3 09:11 restart.0001-01-01_00.00.00.nc
-rw-rw-r-- 1 mpeterse mpeterse 8.9M Dec  3 09:15 restart.0001-02-01_00.00.00.nc
-rw-rw-r-- 1 mpeterse mpeterse 1.3G Dec  3 09:20 restart.0001-03-01_00.00.00.nc

note variable file size. The largest files have 218 time slices, and times look as follows:

mu-fe3.lanl.gov> ncdump -v xtime restart.0001-03-01_00.00.00.nc | tail
  "0001-03-26_10:06:40                                             ",
  "0001-03-26_10:40:00                                             ",
  "0001-03-26_10:40:00                                             ",
  "0001-03-26_11:13:20                                             ",
  "0001-03-26_11:13:20                                             ",
  "0001-03-26_11:46:40                                             ",
  "0001-03-26_11:46:40                                             ",
  "0001-03-26_12:20:00                                             ",
  "0001-03-26_12:20:00                                             " ;
}

The run can be found on the LANL turquoise at test merge branch

mark-petersen commented 9 years ago

The test is at:

/panfs/scratch3/vol16/mpeterse/runs/c36j

Follow the links there. You can also find the 240km grid at:

/turquoise/usr/projects/climate/mpeterse/grids_mpas/earth_init_core/QU.240km/grid.nc
douglasjacobsen commented 9 years ago

I've pushed a fix for this bug here: https://github.com/douglasjacobsen/MPAS/tree/framework/alarm_reset_bugfix

Can you test it with your previous test case and let me know if it fixes it?

mark-petersen commented 9 years ago

Yes, that solved the problem of multiple entries per restart file.

There is still a problem with xtime and the file name not matching. Restarting does not work:

wf-fe2.lanl.gov> cat Restart_timestamp
 0000-12-01_00:06:40
wf-fe2.lanl.gov> ncdump -v xtime restarts/restart.0000-12-01_00.00.00.nc | tail -n 2
  "0000-12-01_00:06:40                                             " ;
}
wf-fe2.lanl.gov> ls -lh restarts
total 114M
-rw-rw-r-- 1 mpeterse mpeterse 8.9M Dec  4 16:46 restart.0000-02-01_00.00.00.nc
-rw-rw-r-- 1 mpeterse mpeterse 8.9M Dec  4 16:49 restart.0000-03-01_00.00.00.nc
-rw-rw-r-- 1 mpeterse mpeterse 8.9M Dec  4 16:51 restart.0000-04-01_00.00.00.nc
-rw-rw-r-- 1 mpeterse mpeterse 8.9M Dec  4 16:54 restart.0000-05-01_00.00.00.nc
-rw-rw-r-- 1 mpeterse mpeterse 8.9M Dec  4 16:57 restart.0000-06-01_00.00.00.nc
-rw-rw-r-- 1 mpeterse mpeterse 8.9M Dec  4 17:00 restart.0000-07-01_00.00.00.nc
-rw-rw-r-- 1 mpeterse mpeterse 8.9M Dec  4 17:03 restart.0000-08-01_00.00.00.nc
-rw-rw-r-- 1 mpeterse mpeterse 8.9M Dec  4 17:06 restart.0000-09-01_00.00.00.nc
-rw-rw-r-- 1 mpeterse mpeterse 8.9M Dec  4 17:08 restart.0000-10-01_00.00.00.nc
-rw-rw-r-- 1 mpeterse mpeterse 8.9M Dec  4 17:11 restart.0000-11-01_00.00.00.nc
-rw-rw-r-- 1 mpeterse mpeterse 8.9M Dec  4 17:14 restart.0000-12-01_00.00.00.nc

If I try to restart it I get:

cja047.localdomain> cat log.0000.err
Found grid stream with template restarts/restart.$Y-$M-$D_$h.$m.$s.nc

 *******************************************************************************
 *****
 Error: Could not open input file 'restarts/restart.0000-12-01_00.06.40.nc' to r
 ead mesh fields
 *******************************************************************************
 *****
mark-petersen commented 9 years ago

Thinking about this more, it may be that the restart interval is required to be evenly divisible by the time step. I can see the convenience of having the name not coincide with xtime for all other streams. Restart is just the exception.

@douglasjacobsen if you agree with the above condition, then I think this is fixed.

douglasjacobsen commented 9 years ago

Yeah, xtime in the file, and used to create the filename are not guaranteed to match up. (i.e. the time the filename is expanded with might not be a time found in the file).

Assuming everything is configured properly (reference_time is setup along with filename_interval) you should be able to restart to any time you have an xtime for. So, I'll try that and see if I run into any errors.

douglasjacobsen commented 9 years ago

@mark-petersen I fixed the other issue and pushed it here: https://github.com/douglasjacobsen/MPAS/tree/framework/alarm_reset_bugfix

Again, please test and let me know if it works for you.

mark-petersen commented 9 years ago

Yes, my test case now runs. Thanks, there does not appear to be any more problems for odd dt intervals.

mark-petersen commented 9 years ago

If I run writing restart files at some time interval, say daily:

                  output_interval="00-00-01_00:00:00"/>

then restart, but change my restart output interval, say

                  output_interval="00-00-05_00:00:00"/>

If I happened to not start at a five-day interval, the restart will fail. For example:

wf524.localdomain> cat Restart_timestamp
 0000-01-07_00:00:00
wf524.localdomain> ls -l restarts/
total 640
lrwxrwxrwx 1 mpeterse mpeterse 50 Dec  5 13:45 restart.0000-01-02_00.00.00.nc -> ../../c40u/restarts/restart.0000-01-02_00.00.00.nc
lrwxrwxrwx 1 mpeterse mpeterse 50 Dec  5 13:45 restart.0000-01-03_00.00.00.nc -> ../../c40u/restarts/restart.0000-01-03_00.00.00.nc
lrwxrwxrwx 1 mpeterse mpeterse 50 Dec  5 13:45 restart.0000-01-04_00.00.00.nc -> ../../c40u/restarts/restart.0000-01-04_00.00.00.nc
lrwxrwxrwx 1 mpeterse mpeterse 50 Dec  5 13:45 restart.0000-01-05_00.00.00.nc -> ../../c40u/restarts/restart.0000-01-05_00.00.00.nc
lrwxrwxrwx 1 mpeterse mpeterse 50 Dec  5 13:45 restart.0000-01-06_00.00.00.nc -> ../../c40u/restarts/restart.0000-01-06_00.00.00.nc
lrwxrwxrwx 1 mpeterse mpeterse 50 Dec  5 13:45 restart.0000-01-07_00.00.00.nc -> ../../c40u/restarts/restart.0000-01-07_00.00.00.nc

fails for both intel and pgi. For intel, there is no error message. For pgi, we get:

 ERROR: File restarts/restart.0000-01-06_00.00.00.nc does not contain the time
 0000-01-07_00:00:00

It tries to open day 6 because of the five day interval (starting at day 1), xtime=day 7 is not in there.

My first request is that we have better text messages about openning the restart file. It took a long time for me to know what was going on here. It would be really helpful, just after

 ----- done parsing run-time I/O from streams.ocean_forward -----

to have a small text message like:

opening file restart.0000-01-06_00.00.00.nc
looking for time  0000-01-07_00:00:00

so that the user knows what went wrong. The intel compiler hides all this, otherwise.

Thanks! Mark

mark-petersen commented 9 years ago

@douglasjacobsen Which one am I supposed to test: remotes/dj/framework/alarm_reset_bugfix remotes/dj/v3.1_fix/alarm_reset_fix

douglasjacobsen commented 9 years ago

The framework/alarm_reset_bugfix. The other one doesn't include both fixes (since it's intended for a bugfix branch).

mark-petersen commented 9 years ago

With the current version at dj/framework/alarm_reset_bugfix

I'm still getting two files written at start-up. Is that expected?

wf-fe1.lanl.gov> ncdump -v xtime output/output.0000-01-01_00.00.00.nc | 
tail
  xtime =
   "0000-10-31_00:00:00                                             ",
   "0000-10-31_00:06:00                                             " ;
}
douglasjacobsen commented 9 years ago

Yes, that's not fixed in this branch, but I have a branch that fixes that.

douglasjacobsen commented 9 years ago

If you want to test all of the fixes together, try this branch: https://github.com/douglasjacobsen/MPAS/tree/hotfix_v3.2

mark-petersen commented 9 years ago

I just tested the hotfix_v3.2 and the output files look great. Thanks for working on this.

douglasjacobsen commented 9 years ago

This was fixed in: https://github.com/MPAS-Dev/MPAS/commit/23e4678657c450f386637d01825811ed2f3e11bb