GEOS-ESM / MAPL

MAPL is a foundation layer of the GEOS architecture, whose original purpose is to supplement the Earth System Modeling Framework (ESMF)
https://geos-esm.github.io/MAPL/
Apache License 2.0
27 stars 18 forks source link

[FEATURE REQUEST] Automatic regridding of restart #639

Open sdeastham opened 3 years ago

sdeastham commented 3 years ago

The GEOS-Chem community would benefit greatly from being able to use a single restart (e.g. C24) for simulations even when at different resolutions (e.g. C48, C180). Although the overhead of regridding the restart is relatively minor, it introduces a failure mode for our community. We currently regrid a single, spun up, C48 restart to all of the "likely" simulation resolutions for each major update to the model, and these restarts are made available to the community. This means that:

We would therefore greatly appreciate it if the restart read-in procedure could include regridding of the data, perhaps through ExtData. In order to avoid disrupting standard operation of GEOS, this behavior could perhaps be permitted only if restart regridding is explicitly enabled in a relevant resource file.

Tagging @tclune and @lizziel following recent discussions.

tclune commented 3 years ago

As an intermediate step, are the error conditions for using incorrect resolution in the restart at least reasonably informative? This seems that it would at least improve the situation for those that do "not realize they are suing restarts from the wrong version".

bena-nasa commented 3 years ago
sdeastham commented 3 years ago

Thanks for the swift response @tclune and @bena-nasa !

That having been said, I'll wait until @lizziel is back (I think a few months from now) and has had a chance to comment before diving further into this.

tclune commented 3 years ago

That first item should definitely be addressed asap. Hopefully just a missing _VERIFY() or _RETURN() somewhere.

mathomp4 commented 3 years ago

That first item should definitely be addressed asap. Hopefully just a missing _VERIFY() or _RETURN() somewhere.

Interesting. It looks like we are "spared" this in GEOS because of the saltwater checker. If I comment that out, indeed, we can run a little while with C48 restarts until the model falls apart. I'll make an issue.

ETA: As seen in the issue linked below, you can indeed run GEOS for a day at C24 with C48 restarts with a few hacks to scripting. Amazing.

sdeastham commented 3 years ago

Thanks for the follow-up @mathomp4 ! That example you showed was.. wow. I wonder what happens if you run (say) C48 with a C24 restart? In any case, I'd be very much in favor of a runtime, in-code verification rather than a script-based guard if at all possible!

mathomp4 commented 3 years ago

Thanks for the follow-up @mathomp4 ! That example you showed was.. wow. I wonder what happens if you run (say) C48 with a C24 restart? In any case, I'd be very much in favor of a runtime, in-code verification rather than a script-based guard if at all possible!

@sdeastham We seem to be better there. As @bena-nasa was telling me, we are using pure netCDF to do some of this. It turns out netCDF is not happy when you ask for c48 amount of data and only c24 worth exists on the files:

 Using parallel NetCDF for file: fvcore_internal_rst
pe=00000 FAIL at line=00037    NetCDF4_get_var.H                        <status=-57>
 Error reading variable          -57
 NetCDF: Start+count exceeds dimension bound

pe=00000 FAIL at line=07253    MAPL_IO.F90                              <status=-57>
pe=00000 FAIL at line=05426    MAPL_IO.F90                              <status=-57>
pe=00000 FAIL at line=01242    MAPL_IO.F90                              <status=-57>
pe=00000 FAIL at line=07392    MAPL_IO.F90                              <status=-57>
pe=00000 FAIL at line=07657    MAPL_IO.F90                              <status=-57>
pe=00000 FAIL at line=05681    MAPL_Generic.F90                         <status=-57>

But in the other case, I guess we just read the first c24 worth of data from a c48 restart and it satisfies netCDF because the buffer was filed. I'm more amazed our model didn't go crazy and crash. The world does not look right in the case in #643

lizziel commented 3 years ago

I'm just getting up to speed on this issue now. I am intrigued by the idea of using an Import to set Internal via ExtData. This would solve a few things for GCHP:

  1. We could output an instantaneous Restart collection via History, as we do for GEOS-Chem Classic, which may be more intuitive to users.
  2. Restarts would automatically be timestamped in the filename, avoiding the need to add it as post-processing in the run script.
  3. The restarts would have the same file format as the History diagnostics making data analysis tools simpler and viewing the restart with panoply possible.
  4. Restarts would not be vertically inverted relative to History diagnostics as they are now, a feature that tends to cause confusion.
  5. Beyond being able to use cubed-sphere restarts at different resolutions, users would be able reuse lat-lon restarts generated by GEOS-Chem Classic without any pre-processing (I'm not saying we would recommend this, just that it would be possible).

The main downside I see is it feels wrong to skip using the Internal state checkpoints. I'll think more about this to try to come up with reasons not to do it.

@sdeastham, what are your thoughts on this?

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. If there are no updates within 7 days, it will be closed. You can add the "long term" tag to prevent the Stale bot from closing this issue.

mathomp4 commented 2 years ago

Labeling as long term. This probably will take a bit of thought and work by @bena-nasa , et al.

bena-nasa commented 2 years ago

As I've said many times before, MAPL_IO has gotten to the point that it needs to be cleaned up and refactored before we add any more capabilities. The code is out of control and adding new capabilities is just getting to the point where is is exceedingly difficult. The binary branch just needs to be pulled out for historical support and a re-write of the netcdf routines needs to happen and hopefully leverage some commonality with the other IO layers used.

lizziel commented 2 years ago

Thanks @bena-nasa. To reiterate our wish list, it would be fantastic if the new design would give the following behavior:

  1. Automatically regrid input restart to run resolution
  2. Indicate the file specs of the input restart in the log, and any pertinent information on the regridding, perhaps through pFlogger
  3. Use same grid format in both History and checkpoints

@sdeastham, am I missing anything? @LiamBindle, any special items to add for stretched grid?

LiamBindle commented 2 years ago

IMO, I would prefer if the grid was validated by checking that a few of the grid coordinates are correct. E.g., check that the latitude and longitude of the [0,0] grid-box of each face is what we expect. This way, we don't introduce custom attributes that need to be present for a simulation to run, and it works for stretched-grids and cubed-sphere.

On the GCHP side, I think it's useful to include attributes like (a) details about the simulation that generated the restart file, (b) stretching parameters if applicable, but I don't think these attributes should be used to enforce anything.