FESOM / fesom2

Multi-resolution ocean general circulation model.
http://fesom.de/
GNU General Public License v3.0
50 stars 49 forks source link

Do not add timesteps to existing restart files #485

Closed JanStreffing closed 2 months ago

JanStreffing commented 1 year ago

The issue: Currently when FESOM2 finds preexisting restart files in a restart folder for the given year, e.g. fesom.1849.oce.restart/ssh.nc and we ask it to write monthly restarts during year 1849 (e.g. during first year of spinup from PHC3), it will keep adding restart timesteps to the netcdf files therein:

cdo sinfo temp.nc_18490501-18490531
   File format : NetCDF4 classic
    -1 : Institut Source   T Steptype Levels Num    Points Num Dtype : Parameter ID
     1 : unknown  unknown  v instant       1   1         1   1  I32  : -1            
     2 : unknown  unknown  v instant       1   1 249666860   2  F64  : -2            
   Grid coordinates :
     1 : generic                  : points=1
     2 : generic                  : points=249666860 (3160340x79)
   Vertical coordinates :
     1 : surface                  : levels=1
   Time coordinate :
                             time : 5 steps
  YYYY-MM-DD hh:mm:ss  YYYY-MM-DD hh:mm:ss  YYYY-MM-DD hh:mm:ss  YYYY-MM-DD hh:mm:ss
  0000-00-00 00:00:00  0000-00-00 00:00:00  0000-00-00 00:00:00  0000-00-00 00:00:00
  0000-00-00 00:00:00
cdo    sinfo: Processed 2 variables over 5 timesteps [0.16s 33MB]

Upon restarting from such a folder/files FESOM2 will use the latest step found and double check that it matches with the fesom.clock. If the timesteps don't match, the model exits.

I find this behavior unsafe. Especially when one accidentally links instead of copying a restart from a pool_dir to a work folder, fesom will try to modify to original restart files in the pool_dir and start adding timesteps there. In the worst case scenario, the user will have write permissions on the pool dir, and the restart file will actually be modified. This recently happened to @mzapponi.

Proposed solution: I think a better solution, would be to have a more detailed timestamp on the restart folder name. e.g. YYYY-MM-DD-HH-MM-SS, or at least YYYY-MM-DD. Instead of checking if the folder/file exists, and if it does adding a timestep, we can check if it exists, and if so, exit the model. This way we never accidentally modify an existing restart file.

Unless I hear a strong no, to the suggestion, I would create a draft for such a change soon.

@patrickscholz @hegish @dsidoren

trackow commented 1 year ago

I would find this a very important and useful functionality, also for the IFS-FESOM workflow with RAPS

JanStreffing commented 1 year ago

I discussed this with some colleagues over lunch: Not only does this behavior force us to copy the restarts instead of being able to link them. It also makes the restart files larger by a factor timesteps_in_the_file. E.g. A single timestep DART restart folder is 31GB. One that has 12 timesteps is thus 372GB, costing us extra space and time.

patrickscholz commented 1 year ago

I think we do the time step in the netcdf restart file and the checkup for the this time-step in the model so that the restart can be arbitrary not just based on a full year, month or day. Especially for debugging it is pretty convenient if you can make a restart based on a specific model time step e.g. before a blowup occurs! If you want to cover all this possibilities you will need a pretty long folder name. In this case you would need to go for the full time stamp description YYYY-MM-DD-HH-MM-SS. Or you use the timestamp number (seconds within the year) directly as a folder name something like YYYY-timestampnumber (e.g. YYYY-31535400).

JanStreffing commented 1 year ago

IMO a longer folder name is an okay price for the ability to link it from pool_dir safely.

JanStreffing commented 2 months ago

This is the third issue for the same problem. See also: https://github.com/FESOM/fesom2/issues/279 and #617. Closing here, lets use the oldest issue.