geoschem / GCHP

The "superproject" wrapper repository for GCHP, the high-performance instance of the GEOS-Chem chemical-transport model.
https://gchp.readthedocs.io
Other
21 stars 25 forks source link

[BUG/ISSUE] Unable to restart stretched-grid simulations #287

Closed kilicomu closed 1 year ago

kilicomu commented 1 year ago

What institution are you from?

Wolfson Atmospheric Chemistry Laboratories.

Description of the problem

Even with the new regridding process covered in https://github.com/geoschem/gcpy/pull/190, it is not possible to complete a multi-segment GCHP stretched-grid run.

A regridded restart can have the correct stretched grid parameters, e.g.

// global attributes:
                :history = "20221123 141751.613: Generated by GCHP regridding tool v0.4" ;
                :title = "Regridded output data" ;
                :format = "netCDF-3" ;
                :STRETCH_FACTOR = 6.f ;
                :TARGET_LAT = 64.f ;
                :TARGET_LON = -32.f ;

but checkpoint files will be produced with stretched grid parameters that will cause the Factories not equal error (#273) on restart, e.g.

// global attributes:
                :STRETCH_FACTOR = 6.f ;
                :TARGET_LAT = 0.5639159f ;
                :TARGET_LON = 5.153085f ;

Whilst resetting these attributes back to the original values will let the simulation continue, I don't know enough about the internals of the stretched-grid to know whether or not this will have a negative effect on the simulation.

GEOS-Chem version

14.1.0

Description of code modifications

None.

Log files

Software versions

kilicomu commented 1 year ago

I'm going to look into the stretched grid in detail to try and make sense of the problem. If anybody has insight into the internal grid of MAPL, I'd be happy to hear about it!

sdeastham commented 1 year ago

I felt certain that this had to be something to do with translation between radians and degrees but that doesn't quite seem to explain it, unless the translation is also off by a factor of not exactly 2... which seems unlikely. @atrayano or @bena-nasa may have some insight?

lizziel commented 1 year ago

Regardless of the strange math that would result in those values, I think the fix would be to retrieve the target values from the config file rather than the grid object when putting together the data to write to file in NCIO.F90.

Instead of this...

       call ESMF_AttributeGet(arrdes%grid,name="TARGET_LON",value=target_lon,rc=status)
       call ESMF_AttributeGet(arrdes%grid,name="TARGET_LAT",value=target_lat,rc=status)

...it should be something like this...

      call ESMF_ConfigGetAttribute(config, target_lon, label=prefix//'TARGET_LON:', default=MAPL_UNDEFINED_REAL)                                                                                              
      call ESMF_ConfigGetAttribute(config, target_lat, label=prefix//'TARGET_LAT:', default=MAPL_UNDEFINED_REAL)

Or the grid attributes need to be expanded to include the value in the config file in addition to radians. That would be in MAPL_CubedSphereGridFactory.F90.

kilicomu commented 1 year ago

@lizziel I have tried to patch something in along the lines of your first suggestion, so far to no avail. It's not obvious to me which data structure would house the config at the point in NCIO.F90 where the grid attributes are being queried! I tried passing the STATE object through from the calling context but haven't yet figured out what its members are, so I'm unsure as to whether or not it contains config information.

I've also tried patching in some new _RESTART attributes in MAPL_CubedSphereGridFactory.F90 to then try and retrieve those at writeout time, instead of the grid attributes. The way I implemented it didn't seem to work.

I think I'm missing something more fundamental, though - I can't see where MAPL_StateVarWriteNCPar, the caller of MAPL_BundleWriteNCPar, is called? I tried to drop some debugging write statements in to get a sense for it but didn't get anywhere, and ran a find -exec grep over the whole GCHP src directory to only find three references to it:

./MAPL/base/NCIO.F90:59:     module procedure MAPL_StateVarWriteNCPar
./MAPL/base/NCIO.F90:3927:  subroutine MAPL_StateVarWriteNCPar(filename, STATE, ARRDES, CLOCK, NAME, forceWriteNoRestart, oClients, RC)
./MAPL/base/NCIO.F90:4087:  end subroutine MAPL_StateVarWriteNCPar

What am I not getting? 😄

lizziel commented 1 year ago

Hmm, I'm going to take a look at this and try something. We are going to release 14.1.1 very soon and I want to fix this in that version. Stay tuned...

lizziel commented 1 year ago

@kilicomu, I reported this to GMAO at a meeting earlier today and at https://github.com/GEOS-ESM/MAPL/issues/1977. Thanks @bena-nasa for looking into it.

lizziel commented 1 year ago

I wrote a quick fix for GCHP. It's a hack and I will replace it with a better way from GMAO once there is one. See https://github.com/geoschem/MAPL/commit/48ad6c9606e9016960a2b67ef802befc7080b55f.

lizziel commented 1 year ago

Even better, use this: https://github.com/GEOS-ESM/MAPL/pull/1736. Apparently this issue was known only to @LiamBindle and the follow-up was after he left. I'll add the fix to 14.1.1. Actually, the GEOS-ESM/MAPL fix does not work since the values are very slightly different. My geoschem/MAPL fix above does work if you need an immediate work-around.

kilicomu commented 1 year ago

@lizziel Great news that it's a known thing and a straightforward fix. I'll take a look at some of the options above and try your workaround. I think our group is ok using more resources and doing single segment runs for now, but if we need to we can also ncatted the checkpoint files to restore the original stretched grid attributes to save people all needing patched versions of the code.

(I am also still v. interested to know what the call path for those subroutines is - if you have a sec sometime can you explain it to me?)

lizziel commented 1 year ago

The call path is a bit cryptic because of the use of procedures and interfaces, some of which are in derived types. Here's what the call traceback is all the way to GCHPctm.F90. There are many layers.

  1. Writing the checkpoint global attributes is in NCIO.F90 subroutine MAPL_BundleWriteNCPar
  2. Calling MAPL_BundleWriteNCPar is in NCIO.F90 subroutine MAPL_StateVarWriteNCPar
  3. MAPL_StateVarWriteNCParis a procedure within NCIO.F90 interface MAPL_VarWriteNCPar, meaning it is invoked given a certain set of arguments passed to MAPL_VarWriteNCPar
  4. Interface MAPL_VarWriteNCPar is called within MAPL_Generic.F90 subroutine MAPL_ESMFStateWriteToFile
  5. MAPL_ESMFStateWriteToFile is called within several MAPL_Generic.F90 subroutines: MAPL_GenericFinalize, MAPL_StateRecord, and MAPL_GenericStateSave. Only MAPL_GenericFinalize is actually ever used.
  6. MAPL_GenericFinalize is called within the Finalize subroutine of every gridded component. This makes sense since a different checkpoint is written for each gridded component. We only have one since we only have one gridded component with internal state fields (GEOS-Chem). The FVdycore and GCHPctmEnv gridded components have no declared internal state fields.
  7. The Finalize subroutines are all invoked within a chain of hierarchical gridded component Finalize calls controlled by ESMF_Finalize, called within MAPL_Cap.F90 subroutine run_model within the MAPL_Cap derived type. This can ultimately be traced to the call cap%run(_RC) in GCHPctm.F90, the driver of the whole model, through mid-level calls to run_member and run_ensemble.
lizziel commented 1 year ago

@kilicomu, is this issue good to close?

kilicomu commented 1 year ago

@lizziel I think so, if you've tested the radians => degrees change and are happy. I'll reopen if it recurs!

On Wed, 22 Mar 2023 at 14:04, Lizzie Lundgren @.***> wrote:

@kilicomu https://github.com/kilicomu, is this issue good to close?

— Reply to this email directly, view it on GitHub https://github.com/geoschem/GCHP/issues/287#issuecomment-1479625677, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACG5SE7LEQF2SUUT6SDCG23W5MBG3ANCNFSM6AAAAAAU2FGKCE . You are receiving this because you were mentioned.Message ID: @.***>

-- Killian Murphy Research Software Engineer

Wolfson Atmospheric Chemistry Laboratories University of York Heslington York YO10 5DD +44 (0)1904 32 3634

e-mail disclaimer: http://www.york.ac.uk/docs/disclaimer/email.htm

lizziel commented 1 year ago

See also my new fix for this which is much simpler. It is going into 14.2. All of the deleted lines in the PR were added in my last fix. https://github.com/geoschem/MAPL/pull/28

kilicomu commented 1 year ago

Oh yeah, nice!

On Wed, 22 Mar 2023 at 16:32, Lizzie Lundgren @.***> wrote:

See also my new fix for this which is much simpler. It is going into 14.2. All of the deleted lines in the PR were added in my last fix. geoschem/MAPL#28 https://github.com/geoschem/MAPL/pull/28

— Reply to this email directly, view it on GitHub https://github.com/geoschem/GCHP/issues/287#issuecomment-1479897274, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACG5SEYCFQKDSN7BLNIPJNLW5MSQRANCNFSM6AAAAAAU2FGKCE . You are receiving this because you were mentioned.Message ID: @.***>

-- Killian Murphy Research Software Engineer

Wolfson Atmospheric Chemistry Laboratories University of York Heslington York YO10 5DD +44 (0)1904 32 3634

e-mail disclaimer: http://www.york.ac.uk/docs/disclaimer/email.htm