GEOS-ESM / MAPL

MAPL is a foundation layer of the GEOS architecture, whose original purpose is to supplement the Earth System Modeling Framework (ESMF)
https://geos-esm.github.io/MAPL/
Apache License 2.0
27 stars 18 forks source link

AddChildFromDSO Refactoring has broken MOM6 runs with MAPL3 #1678

Closed mathomp4 closed 1 year ago

mathomp4 commented 2 years ago

Last night's (2022-Sep-15) nightly run of MOM6 with MAPL3 showed failure. The first error was:

h-point: mean=   0.0000000000000000E+00 min=   0.0000000000000000E+00 max=   0.0000000000000000E+00 Post extract_sfc frazil
h-point: c=         0 Post extract_sfc frazil
  Surface velocity stagger set in ocean model: (MOM6) CGRID_NE.
pe=00000 FAIL at line=00076    GetPointer.H                             <status=13>
MOM6Initialize                                 873
pe=00000 FAIL at line=01645    MAPL_Generic.F90                         <status=13>
MOM6Initialize                                 873

But that apparently isn't full trapped up the tree, so it was eventually followed by:

 Real*4 Resource Parameter: TAUCRIT:.100000
ORADRun                                        462
OGCMRun                                       2102
Run_Ocean                                     2099
GCMRun                                        1909
pe=00000 FAIL at line=00076    GetPointer.H                             <status=13>
pe=00000 FAIL at line=01645    MAPL_Generic.F90                         <status=13>
pe=00000 FAIL at line=01765    MAPL_Generic.F90                         <status=13>
pe=00000 FAIL at line=01645    MAPL_Generic.F90                         <status=13>
pe=00000 FAIL at line=01645    MAPL_Generic.F90                         <status=13>

The night before it worked with f9a99b79, so therefore the difference in MAPL is https://github.com/GEOS-ESM/MAPL/compare/f9a99b79...fcd7a312 (as fcd7a312 is the hash of MAPL3 as of this writing).

I have looked at the complete source code between two night's ago and last night and the only differences are in MAPL and they are the differences above.

As a test, I did a clone of MAPL3 GEOS and checked out f9a99b79 and built that and it does indeed work.

The thing is...this is a very boring chance to MAPL_Generic. It is just what is in #1672 which is...just removing duplicated code! I took a look with @bena-nasa and we do not see any mistake in that change (in develop or release/MAPL-v3.

I am...baffled.

I'll mention @tclune as well as @sanAkel just because it is MOM6 related...ish.

mathomp4 commented 2 years ago

Also, at the suggestion of @bena-nasa I ran with ESMF logging turned on and we see:

20220916 092028.285 ERROR            PET0 ESMF_LocalArrayGet.F90:2484 ESMF_LocalArrayGetData Arguments are incompatible  - farrayPtr rank does not match LocalArray rank
20220916 092028.285 ERROR            PET0 ESMF_ArrayGet.F90:1888 ESMF_ArrayGetFPtr Arguments are incompatible  - Internal subroutine call returned Error
20220916 092028.285 ERROR            PET0 ESMF_FieldGet.F90:2513 ESMF_FieldGetDataPtr Arguments are incompatible  - Internal subroutine call returned Error

I also did a run with Debugging enabled and we still see:

  Surface velocity stagger set in ocean model: (MOM6) CGRID_NE.
MOM6Initialize                                 873
pe=00003 FAIL at line=00076    GetPointer.H                             <status=13>
pe=00003 FAIL at line=01645    MAPL_Generic.F90                         <status=13>

but also:

 Character Resource Parameter: OCEAN_INTERNAL_RESTART_FILE:ocean_internal_rst
 Bootstrapping ocean_internal_rst
forrtl: severe (408): fort: (33): Shape mismatch: The extent of dimension 2 of array MASKO is 18 and the corresponding extent of array MASK is 12

Image              PC                Routine            Line        Source
GEOSgcm.x          000000000F5659EF  Unknown               Unknown  Unknown
GEOSgcm.x          000000000F4FCC73  geos_oceangridcom         821  GEOS_OceanGridComp.F90
libesmf.so         00002AD9F520C554  _ZN5ESMCI6FTable1     Unknown  Unknown
libesmf.so         00002AD9F52100CF  ESMCI_FTableCallE     Unknown  Unknown

It's like the memory state got messed up, but messed up carefully enough that a different array is there...but not the right one???

sanAkel commented 2 years ago

I am afraid I can't look into this today, but may help if try running with MOM5- the DSO "should" start "acting" if that's the cause??

mathomp4 commented 2 years ago

I am afraid I can't look into this today, but may help if try running with MOM5- the DSO "should" start "acting" if that's the cause??

@sanAkel No need for you to test, this was obviously "caused" by MAPL. Why I have no idea, but we are looking.

sanAkel commented 2 years ago

Thank you @mathomp4

Maybe sometime soon we should discuss whether we want to simply build a specific ocean model and be done with this (DSO) altogether! It served us well (so far)!

mathomp4 commented 2 years ago

More updates. I made a version of the MAPL_Generic.F90 from release/MAPL-v3 but I brought back the changes from AddChildFromDSO in f9a99b79 and just shoved them in by hand and that runs.

sanAkel commented 2 years ago

Glad that it runs! Can it also switch (toggle) between MOM5/MOM6?

Also what about land "stuff"? Isn't something there (land) also using DSO?

mathomp4 commented 2 years ago

Glad that it runs! Can it also switch (toggle) between MOM5/MOM6?

Also what about land "stuff"? Isn't something there (land) also using DSO?

Well, it runs with a bad workaround. @bena-nasa is looking at what's happening in deeper detail. For now I'll keep the code around and we might wait for @tclune to return and talk with him. This kludge shouldn't be necessary!

bena-nasa commented 2 years ago

Well, I figured out the what, of what is going wrong. I'll detail is here. As to the why, this maybe require some brainstorming with @tclune and @atrayano Here's the deal. As Matt found and I've reproduced, in the MOM6_Plug, there's a GetPointer that fails and ESMF claims the pointer rank is different the field rank (which I did confirm). Furthermore we both found that when you turn on the debug build up in the GEOS_OgcmGridComp.F90 you fail on an array shape mismatch.

The initial symptoms seemed to be pointing to the fields not being created correctly.

I first put prints in GetPointer and I added a new export in the Export state of the plug that I called BENBEN. Since I know it can't be connected to anything when I call GetPointer to get it with the other GetPointers in initialize that started this, it should not be allocated and would have to create it. Well the first thing that made no sense was that GetPointer said BENBEN was already created. This just should not be the case since it is involved in no connectivities and not requested by History. So I start putting more prints in the places where we create the fields for the state in the first place.

After putting a lot prints in I've traced it to this: During MAPL_GenericInitialize the import/internal/export states get created ultimately via a call in MAPL_Generic.F90 to MAPL_StateCreateFromSpecNew which basically translates the VarSpec to the actual state.

It just loops over the individual varspecs for the state, gets the short_name, long_name, units, etc. One of the items obtained from the varspec is an ESMF_Field which a local variable spec_field. Inside the VarSpecType is a pointer to an ESMF_Field, the contents of which depend on the connectivity etc during the wiring. Now for the crucial part. Around like 6370 there's this logic where I put in some prints in the loop over the var specs:

 6369          isCreated = ESMF_FieldIsCreated(SPEC_FIELD, rc=status)
 6370          _VERIFY(status)
 6371
 6372          if (mapl_am_i_root()) write(*,*)"bmaa is created ",isCreated,' ',trim(short_name)
 6373          if (isCreated) then
 6374             block
 6375                character(len=ESMF_MAXSTR) :: lets_get_fname
 6376                call ESMF_FIeldGet(spec_field,name=lets_get_fname,_RC)
 6377                if (mapl_am_i_root()) write(*,*)"bmaa found created ",trim(short_name),' ',trim(lets_get_fname),'       ',trim(long_name),' ',trim(units)
 6378             end block
 6379             call MAPL_AllocateCoupling( SPEC_FIELD, RC=status ) ! if 'DEFER' this allocates the data
 6380             _VERIFY(status)

Basically it checks this spec_field and asks is it created and goes from there. As you can see I print: is the spec_field created. Then if it is in that if statement I get the name from the spec field and print that along with the short_name, long_name, and Units from the varspec. And here's a sample of what I see corresponding to items in the export state of the plug, notice what happens when you get to the variable TW:

 bmaa is created  T  UW
 bmaa found created UW UW surface_Agrid_eastward_velocity m s-1
 bmaa is created  T  VW
 bmaa found created VW VW surface_Agrid_northward_velocity m s-1
 bmaa is created  T  UWB
 bmaa found created UWB UWB surface_Bgrid_X_velocity m s-1
 bmaa is created  T  VWB
 bmaa found created VWB VWB surface_Bgrid_Y_velocity m s-1
 bmaa is created  T  TW
 bmaa found created TW OH surface_temperature K
 bmaa is created  T  SW
 bmaa found created SW CO_BIOMASS_VOC surface_salinity psu
 bmaa is created  T  MOM_2D_MASK
 bmaa found created MOM_2D_MASK CO_BF_VOC MOM_ocean_mask_at_t-points 1
 bmaa is created  T  AREA
 bmaa found created AREA CO_FS_VOC MOM_ocean_area_at_t-points m+2
 bmaa is created  T  SLV
 bmaa found created SLV SLV sea_level_with_ice_loading_and_invBaro m
 bmaa is created  T  FRAZIL
 bmaa found created FRAZIL CO_CH4 heating_from_frazil_formation W m-2
 bmaa is created  T  MELT_POT
 bmaa found created MELT_POT CO_BF heat_that_can_be_used_to_melt_sea_ice W m-2
 bmaa is created  T  FRZMLT
 bmaa found created FRZMLT CO_FS freeze_melt_potential W m-2
 bmaa is created  T  DUM1
 bmaa found created DUM1 CO_ISOP dummy_export1 1
 bmaa is created  T  DUM2
 bmaa found created DUM2 CO_NVOC dummy_export2 1
 bmaa is created  T  BENBEN
 bmaa found created BENBEN CO_TERP dummy_export2 1

Somehow the spec_field pulled out of the varspec is pointing to a field from GOCART for some of the varspecs! This explains why the fields were either the wrong dimension or the shape of the atmospheric fields. These end up getting used in the MOM6 plug! That's as far as I've gotten.

Now I also looked at Matt's changes to the addchildFromDSO, it really didn't change anything as far as the call sequence so this is concerning. It's like the field pointer in the varspec (but not other things) has been corrupted in the data structure. Even more confusing, we have no automated connectivities between the ocean and atmosphere. There should not be a chance to get a miswiring like this.

sanAkel commented 2 years ago

This is all very detailed @bena-nasa. Thank you!

Basically it checks this spec_field and asks is it created and goes from there. As you can see I print: is the spec_field created. Then if it is in that if statement I get the name from the spec field and print that along with the short_name, long_name, and Units from the varspec. And here's a sample of what I see corresponding to items in the export state of the plug, notice what happens when you get to the variable TW:

 bmaa is created  T  UW
 bmaa found created UW UW surface_Agrid_eastward_velocity m s-1
 bmaa is created  T  VW
 bmaa found created VW VW surface_Agrid_northward_velocity m s-1
 bmaa is created  T  UWB
 bmaa found created UWB UWB surface_Bgrid_X_velocity m s-1
 bmaa is created  T  VWB
 bmaa found created VWB VWB surface_Bgrid_Y_velocity m s-1
 bmaa is created  T  TW
 bmaa found created TW OH surface_temperature K
 bmaa is created  T  SW
 bmaa found created SW CO_BIOMASS_VOC surface_salinity psu
 bmaa is created  T  MOM_2D_MASK
 bmaa found created MOM_2D_MASK CO_BF_VOC MOM_ocean_mask_at_t-points 1
 bmaa is created  T  AREA
 bmaa found created AREA CO_FS_VOC MOM_ocean_area_at_t-points m+2
 bmaa is created  T  SLV
 bmaa found created SLV SLV sea_level_with_ice_loading_and_invBaro m
 bmaa is created  T  FRAZIL
 bmaa found created FRAZIL CO_CH4 heating_from_frazil_formation W m-2
 bmaa is created  T  MELT_POT
 bmaa found created MELT_POT CO_BF heat_that_can_be_used_to_melt_sea_ice W m-2
 bmaa is created  T  FRZMLT
 bmaa found created FRZMLT CO_FS freeze_melt_potential W m-2
 bmaa is created  T  DUM1
 bmaa found created DUM1 CO_ISOP dummy_export1 1
 bmaa is created  T  DUM2
 bmaa found created DUM2 CO_NVOC dummy_export2 1
 bmaa is created  T  BENBEN
 bmaa found created BENBEN CO_TERP dummy_export2 1

Above does not make sense! Things like OH, CO_CH4 are meaningless "in my (ocean) world"!

Even more confusing, we have no automated connectivities between the ocean and atmosphere. There should not be a chance to get a miswiring like this.

Agree! There are no direct, "automated connections" they have conditions and must go through GCM (via A2O and/or O2A transforms).

bena-nasa commented 2 years ago

A few more data points. It appears the same corruption is observed with GNU and with the Intel release build. This is bizarre, what are the chances that that we screw up memory and have to still point to the same good memory in all these cases :(

sanAkel commented 2 years ago

Not sure why our component (ocean) gets into such mess! 😢

tclune commented 1 year ago

@bena-nasa Yeah - can't be typical memory corruption in this case. At best it is an out-of-bounds array reference that only affects some isolated index or something like that.

atrayano commented 1 year ago

I am suspecting a bug in the ESMF implementation. This would be consistent with Ben's results from Friday (i.e. the code runs when using the static library), and with my experiments. I coded our own DSO implementation, which at the moment I put in the GEOS_OceanGridComp (simply to prototype, debug, etc.). This code runs successfully. Once we review it, we could move it MAPL, and clean the ocean component. My branch is bugfix/atrayano/dso_workaround, of the @GEOS_OceanGridComp repo. Again, this is only a placeholder branch, not a PR.

tclune commented 1 year ago

Putting the permalink to the workaround here so that @rsdunlapiv can maybe comment on what might be different in ESMF under the hood.

https://github.com/GEOS-ESM/GEOS_OceanGridComp/blob/626adbaa1d2415da6c8c523d2954d4f2e9836a0b/GEOS_OceanGridComp.F90#L198-L207

Hmm. Permalink won't show. Maybe because it is a different repo?

mathomp4 commented 1 year ago

Putting the permalink to the workaround here so that @rsdunlapiv can maybe comment on what might be different in ESMF under the hood.

https://github.com/GEOS-ESM/GEOS_OceanGridComp/blob/626adbaa1d2415da6c8c523d2954d4f2e9836a0b/GEOS_OceanGridComp.F90#L198-L207

Hmm. Permalink won't show. Maybe because it is a different repo?

Yeah. GitHub only does the "nice display" in the same repo. But when you click it, it's useful!

mathomp4 commented 1 year ago

I am suspecting a bug in the ESMF implementation. This would be consistent with Ben's results from Friday (i.e. the code runs when using the static library), and with my experiments. I coded our own DSO implementation, which at the moment I put in the GEOS_OceanGridComp (simply to prototype, debug, etc.). This code runs successfully. Once we review it, we could move it MAPL, and clean the ocean component. My branch is bugfix/atrayano/dso_workaround, of the @GEOS_OceanGridComp repo. Again, this is only a placeholder branch, not a PR.

@atrayano If you do port this to MAPL, we might need to be careful with macOS. I think it will work as I see pages on the web about dlopen on mac being a thing, but we'd at least need to keep the logic that fixes up the .so/.dylib business.

Of course, then you see posts about dlopen on macOS being weird 7 years ago...

tclune commented 1 year ago

Indeed, the right bit of code to do this is: https://github.com/GEOS-ESM/MAPL/blob/f3ddd913badac0c17a53e6cd63e9e6e28966cebc/shared/DSO_Utilities.F90#L49-L55

If it is not right, then hopefully it can also be fixed in that layer.

mathomp4 commented 1 year ago

Just to let @sanAkel know, this should now be fixed. @atrayano found the fix and it's been applied to MAPL.

sanAkel commented 1 year ago

Thank you @mathomp4, @atrayano