Open sdeastham opened 3 years ago
As an intermediate step, are the error conditions for using incorrect resolution in the restart at least reasonably informative? This seems that it would at least improve the situation for those that do "not realize they are suing restarts from the wrong version".
While it would be possible to add this ability to the IO layer (MAPL_IO) we are really getting to the point that if we want to start adding new capabilities like this, it really needs a refactoring before embarking on adding new functionality to MAPL_IO. It as become too unwieldy. (This echos what I said for this request: https://github.com/GEOS-ESM/MAPL/issues/635)
What about an intermediate solution. I already distribute a tool for regridding History output as a MAPL utility as a part of MAPL. It as gotten to the point I need to do some updates. I could extend it to handle at least the atmospheric restart files and just be another step in your workflow the user runs and at the scripting level just regrid the C48 restart to whatever is desired before running the GCHP executable. This would only take a minute or two to run.
Having restarts generally go through ExtData is not realistic at this time. However, for a special case like GCHP, you certainly could trick ExtData into doing this. I'm not sure how GCHP works, but certainly you can declare any import you want at the component level and have ExtData fill it as long as it is a valid file. You could imagine if you have internal state variables you want filled in a component, declaring identical imports, have extdata fill those on it's first run and specify it only fills those once, then copy those to the internal state variables, then destroy the fields in the import state you no longer need. You could do this in lieu of having a restart file for the component that is read by the existing IO machinery or restarts.
Thanks for the swift response @tclune and @bena-nasa !
That having been said, I'll wait until @lizziel is back (I think a few months from now) and has had a chance to comment before diving further into this.
That first item should definitely be addressed asap. Hopefully just a missing _VERIFY()
or _RETURN()
somewhere.
That first item should definitely be addressed asap. Hopefully just a missing
_VERIFY()
or_RETURN()
somewhere.
Interesting. It looks like we are "spared" this in GEOS because of the saltwater checker. If I comment that out, indeed, we can run a little while with C48 restarts until the model falls apart. I'll make an issue.
ETA: As seen in the issue linked below, you can indeed run GEOS for a day at C24 with C48 restarts with a few hacks to scripting. Amazing.
Thanks for the follow-up @mathomp4 ! That example you showed was.. wow. I wonder what happens if you run (say) C48 with a C24 restart? In any case, I'd be very much in favor of a runtime, in-code verification rather than a script-based guard if at all possible!
Thanks for the follow-up @mathomp4 ! That example you showed was.. wow. I wonder what happens if you run (say) C48 with a C24 restart? In any case, I'd be very much in favor of a runtime, in-code verification rather than a script-based guard if at all possible!
@sdeastham We seem to be better there. As @bena-nasa was telling me, we are using pure netCDF to do some of this. It turns out netCDF is not happy when you ask for c48 amount of data and only c24 worth exists on the files:
Using parallel NetCDF for file: fvcore_internal_rst
pe=00000 FAIL at line=00037 NetCDF4_get_var.H <status=-57>
Error reading variable -57
NetCDF: Start+count exceeds dimension bound
pe=00000 FAIL at line=07253 MAPL_IO.F90 <status=-57>
pe=00000 FAIL at line=05426 MAPL_IO.F90 <status=-57>
pe=00000 FAIL at line=01242 MAPL_IO.F90 <status=-57>
pe=00000 FAIL at line=07392 MAPL_IO.F90 <status=-57>
pe=00000 FAIL at line=07657 MAPL_IO.F90 <status=-57>
pe=00000 FAIL at line=05681 MAPL_Generic.F90 <status=-57>
But in the other case, I guess we just read the first c24 worth of data from a c48 restart and it satisfies netCDF because the buffer was filed. I'm more amazed our model didn't go crazy and crash. The world does not look right in the case in #643
I'm just getting up to speed on this issue now. I am intrigued by the idea of using an Import to set Internal via ExtData. This would solve a few things for GCHP:
The main downside I see is it feels wrong to skip using the Internal state checkpoints. I'll think more about this to try to come up with reasons not to do it.
@sdeastham, what are your thoughts on this?
This issue has been automatically marked as stale because it has not had recent activity. If there are no updates within 7 days, it will be closed. You can add the "long term" tag to prevent the Stale bot from closing this issue.
Labeling as long term. This probably will take a bit of thought and work by @bena-nasa , et al.
As I've said many times before, MAPL_IO has gotten to the point that it needs to be cleaned up and refactored before we add any more capabilities. The code is out of control and adding new capabilities is just getting to the point where is is exceedingly difficult. The binary branch just needs to be pulled out for historical support and a re-write of the netcdf routines needs to happen and hopefully leverage some commonality with the other IO layers used.
Thanks @bena-nasa. To reiterate our wish list, it would be fantastic if the new design would give the following behavior:
@sdeastham, am I missing anything? @LiamBindle, any special items to add for stretched grid?
IMO, I would prefer if the grid was validated by checking that a few of the grid coordinates are correct. E.g., check that the latitude and longitude of the [0,0] grid-box of each face is what we expect. This way, we don't introduce custom attributes that need to be present for a simulation to run, and it works for stretched-grids and cubed-sphere.
On the GCHP side, I think it's useful to include attributes like (a) details about the simulation that generated the restart file, (b) stretching parameters if applicable, but I don't think these attributes should be used to enforce anything.
The GEOS-Chem community would benefit greatly from being able to use a single restart (e.g. C24) for simulations even when at different resolutions (e.g. C48, C180). Although the overhead of regridding the restart is relatively minor, it introduces a failure mode for our community. We currently regrid a single, spun up, C48 restart to all of the "likely" simulation resolutions for each major update to the model, and these restarts are made available to the community. This means that:
We would therefore greatly appreciate it if the restart read-in procedure could include regridding of the data, perhaps through ExtData. In order to avoid disrupting standard operation of GEOS, this behavior could perhaps be permitted only if restart regridding is explicitly enabled in a relevant resource file.
Tagging @tclune and @lizziel following recent discussions.