Closed mathomp4 closed 1 year ago
Also, at the suggestion of @bena-nasa I ran with ESMF logging turned on and we see:
20220916 092028.285 ERROR PET0 ESMF_LocalArrayGet.F90:2484 ESMF_LocalArrayGetData Arguments are incompatible - farrayPtr rank does not match LocalArray rank
20220916 092028.285 ERROR PET0 ESMF_ArrayGet.F90:1888 ESMF_ArrayGetFPtr Arguments are incompatible - Internal subroutine call returned Error
20220916 092028.285 ERROR PET0 ESMF_FieldGet.F90:2513 ESMF_FieldGetDataPtr Arguments are incompatible - Internal subroutine call returned Error
I also did a run with Debugging enabled and we still see:
Surface velocity stagger set in ocean model: (MOM6) CGRID_NE.
MOM6Initialize 873
pe=00003 FAIL at line=00076 GetPointer.H <status=13>
pe=00003 FAIL at line=01645 MAPL_Generic.F90 <status=13>
but also:
Character Resource Parameter: OCEAN_INTERNAL_RESTART_FILE:ocean_internal_rst
Bootstrapping ocean_internal_rst
forrtl: severe (408): fort: (33): Shape mismatch: The extent of dimension 2 of array MASKO is 18 and the corresponding extent of array MASK is 12
Image PC Routine Line Source
GEOSgcm.x 000000000F5659EF Unknown Unknown Unknown
GEOSgcm.x 000000000F4FCC73 geos_oceangridcom 821 GEOS_OceanGridComp.F90
libesmf.so 00002AD9F520C554 _ZN5ESMCI6FTable1 Unknown Unknown
libesmf.so 00002AD9F52100CF ESMCI_FTableCallE Unknown Unknown
It's like the memory state got messed up, but messed up carefully enough that a different array is there...but not the right one???
I am afraid I can't look into this today, but may help if try running with MOM5- the DSO "should" start "acting" if that's the cause??
I am afraid I can't look into this today, but may help if try running with MOM5- the DSO "should" start "acting" if that's the cause??
@sanAkel No need for you to test, this was obviously "caused" by MAPL. Why I have no idea, but we are looking.
Thank you @mathomp4
Maybe sometime soon we should discuss whether we want to simply build a specific ocean model and be done with this (DSO) altogether! It served us well (so far)!
More updates. I made a version of the MAPL_Generic.F90
from release/MAPL-v3
but I brought back the changes from AddChildFromDSO
in f9a99b79 and just shoved them in by hand and that runs.
Glad that it runs! Can it also switch (toggle) between MOM5/MOM6?
Also what about land "stuff"? Isn't something there (land) also using DSO?
Glad that it runs! Can it also switch (toggle) between MOM5/MOM6?
Also what about land "stuff"? Isn't something there (land) also using DSO?
Well, it runs with a bad workaround. @bena-nasa is looking at what's happening in deeper detail. For now I'll keep the code around and we might wait for @tclune to return and talk with him. This kludge shouldn't be necessary!
Well, I figured out the what, of what is going wrong. I'll detail is here. As to the why, this maybe require some brainstorming with @tclune and @atrayano Here's the deal. As Matt found and I've reproduced, in the MOM6_Plug, there's a GetPointer that fails and ESMF claims the pointer rank is different the field rank (which I did confirm). Furthermore we both found that when you turn on the debug build up in the GEOS_OgcmGridComp.F90 you fail on an array shape mismatch.
The initial symptoms seemed to be pointing to the fields not being created correctly.
I first put prints in GetPointer and I added a new export in the Export state of the plug that I called BENBEN. Since I know it can't be connected to anything when I call GetPointer to get it with the other GetPointers in initialize that started this, it should not be allocated and would have to create it. Well the first thing that made no sense was that GetPointer said BENBEN was already created. This just should not be the case since it is involved in no connectivities and not requested by History. So I start putting more prints in the places where we create the fields for the state in the first place.
After putting a lot prints in I've traced it to this: During MAPL_GenericInitialize the import/internal/export states get created ultimately via a call in MAPL_Generic.F90 to MAPL_StateCreateFromSpecNew which basically translates the VarSpec to the actual state.
It just loops over the individual varspecs for the state, gets the short_name, long_name, units, etc. One of the items obtained from the varspec is an ESMF_Field which a local variable spec_field. Inside the VarSpecType is a pointer to an ESMF_Field, the contents of which depend on the connectivity etc during the wiring. Now for the crucial part. Around like 6370 there's this logic where I put in some prints in the loop over the var specs:
6369 isCreated = ESMF_FieldIsCreated(SPEC_FIELD, rc=status)
6370 _VERIFY(status)
6371
6372 if (mapl_am_i_root()) write(*,*)"bmaa is created ",isCreated,' ',trim(short_name)
6373 if (isCreated) then
6374 block
6375 character(len=ESMF_MAXSTR) :: lets_get_fname
6376 call ESMF_FIeldGet(spec_field,name=lets_get_fname,_RC)
6377 if (mapl_am_i_root()) write(*,*)"bmaa found created ",trim(short_name),' ',trim(lets_get_fname),' ',trim(long_name),' ',trim(units)
6378 end block
6379 call MAPL_AllocateCoupling( SPEC_FIELD, RC=status ) ! if 'DEFER' this allocates the data
6380 _VERIFY(status)
Basically it checks this spec_field and asks is it created and goes from there. As you can see I print: is the spec_field created. Then if it is in that if statement I get the name from the spec field and print that along with the short_name, long_name, and Units from the varspec. And here's a sample of what I see corresponding to items in the export state of the plug, notice what happens when you get to the variable TW:
bmaa is created T UW
bmaa found created UW UW surface_Agrid_eastward_velocity m s-1
bmaa is created T VW
bmaa found created VW VW surface_Agrid_northward_velocity m s-1
bmaa is created T UWB
bmaa found created UWB UWB surface_Bgrid_X_velocity m s-1
bmaa is created T VWB
bmaa found created VWB VWB surface_Bgrid_Y_velocity m s-1
bmaa is created T TW
bmaa found created TW OH surface_temperature K
bmaa is created T SW
bmaa found created SW CO_BIOMASS_VOC surface_salinity psu
bmaa is created T MOM_2D_MASK
bmaa found created MOM_2D_MASK CO_BF_VOC MOM_ocean_mask_at_t-points 1
bmaa is created T AREA
bmaa found created AREA CO_FS_VOC MOM_ocean_area_at_t-points m+2
bmaa is created T SLV
bmaa found created SLV SLV sea_level_with_ice_loading_and_invBaro m
bmaa is created T FRAZIL
bmaa found created FRAZIL CO_CH4 heating_from_frazil_formation W m-2
bmaa is created T MELT_POT
bmaa found created MELT_POT CO_BF heat_that_can_be_used_to_melt_sea_ice W m-2
bmaa is created T FRZMLT
bmaa found created FRZMLT CO_FS freeze_melt_potential W m-2
bmaa is created T DUM1
bmaa found created DUM1 CO_ISOP dummy_export1 1
bmaa is created T DUM2
bmaa found created DUM2 CO_NVOC dummy_export2 1
bmaa is created T BENBEN
bmaa found created BENBEN CO_TERP dummy_export2 1
Somehow the spec_field pulled out of the varspec is pointing to a field from GOCART for some of the varspecs! This explains why the fields were either the wrong dimension or the shape of the atmospheric fields. These end up getting used in the MOM6 plug! That's as far as I've gotten.
Now I also looked at Matt's changes to the addchildFromDSO, it really didn't change anything as far as the call sequence so this is concerning. It's like the field pointer in the varspec (but not other things) has been corrupted in the data structure. Even more confusing, we have no automated connectivities between the ocean and atmosphere. There should not be a chance to get a miswiring like this.
This is all very detailed @bena-nasa. Thank you!
Basically it checks this spec_field and asks is it created and goes from there. As you can see I print: is the spec_field created. Then if it is in that if statement I get the name from the spec field and print that along with the short_name, long_name, and Units from the varspec. And here's a sample of what I see corresponding to items in the export state of the plug, notice what happens when you get to the variable TW:
bmaa is created T UW bmaa found created UW UW surface_Agrid_eastward_velocity m s-1 bmaa is created T VW bmaa found created VW VW surface_Agrid_northward_velocity m s-1 bmaa is created T UWB bmaa found created UWB UWB surface_Bgrid_X_velocity m s-1 bmaa is created T VWB bmaa found created VWB VWB surface_Bgrid_Y_velocity m s-1 bmaa is created T TW bmaa found created TW OH surface_temperature K bmaa is created T SW bmaa found created SW CO_BIOMASS_VOC surface_salinity psu bmaa is created T MOM_2D_MASK bmaa found created MOM_2D_MASK CO_BF_VOC MOM_ocean_mask_at_t-points 1 bmaa is created T AREA bmaa found created AREA CO_FS_VOC MOM_ocean_area_at_t-points m+2 bmaa is created T SLV bmaa found created SLV SLV sea_level_with_ice_loading_and_invBaro m bmaa is created T FRAZIL bmaa found created FRAZIL CO_CH4 heating_from_frazil_formation W m-2 bmaa is created T MELT_POT bmaa found created MELT_POT CO_BF heat_that_can_be_used_to_melt_sea_ice W m-2 bmaa is created T FRZMLT bmaa found created FRZMLT CO_FS freeze_melt_potential W m-2 bmaa is created T DUM1 bmaa found created DUM1 CO_ISOP dummy_export1 1 bmaa is created T DUM2 bmaa found created DUM2 CO_NVOC dummy_export2 1 bmaa is created T BENBEN bmaa found created BENBEN CO_TERP dummy_export2 1
Above does not make sense! Things like OH
, CO_CH4
are meaningless "in my (ocean) world"!
Even more confusing, we have no automated connectivities between the ocean and atmosphere. There should not be a chance to get a miswiring like this.
Agree! There are no direct, "automated connections" they have conditions and must go through GCM
(via A2O
and/or O2A
transforms).
A few more data points. It appears the same corruption is observed with GNU and with the Intel release build. This is bizarre, what are the chances that that we screw up memory and have to still point to the same good memory in all these cases :(
Not sure why our component (ocean) gets into such mess! 😢
@bena-nasa Yeah - can't be typical memory corruption in this case. At best it is an out-of-bounds array reference that only affects some isolated index or something like that.
I am suspecting a bug in the ESMF implementation. This would be consistent with Ben's results from Friday (i.e. the code runs when using the static library), and with my experiments. I coded our own DSO implementation, which at the moment I put in the GEOS_OceanGridComp (simply to prototype, debug, etc.). This code runs successfully. Once we review it, we could move it MAPL, and clean the ocean component. My branch is bugfix/atrayano/dso_workaround, of the @GEOS_OceanGridComp repo. Again, this is only a placeholder branch, not a PR.
Putting the permalink to the workaround here so that @rsdunlapiv can maybe comment on what might be different in ESMF under the hood.
Hmm. Permalink won't show. Maybe because it is a different repo?
Putting the permalink to the workaround here so that @rsdunlapiv can maybe comment on what might be different in ESMF under the hood.
Hmm. Permalink won't show. Maybe because it is a different repo?
Yeah. GitHub only does the "nice display" in the same repo. But when you click it, it's useful!
I am suspecting a bug in the ESMF implementation. This would be consistent with Ben's results from Friday (i.e. the code runs when using the static library), and with my experiments. I coded our own DSO implementation, which at the moment I put in the GEOS_OceanGridComp (simply to prototype, debug, etc.). This code runs successfully. Once we review it, we could move it MAPL, and clean the ocean component. My branch is bugfix/atrayano/dso_workaround, of the @GEOS_OceanGridComp repo. Again, this is only a placeholder branch, not a PR.
@atrayano If you do port this to MAPL, we might need to be careful with macOS. I think it will work as I see pages on the web about dlopen
on mac being a thing, but we'd at least need to keep the logic that fixes up the .so
/.dylib
business.
Of course, then you see posts about dlopen
on macOS being weird 7 years ago...
Indeed, the right bit of code to do this is: https://github.com/GEOS-ESM/MAPL/blob/f3ddd913badac0c17a53e6cd63e9e6e28966cebc/shared/DSO_Utilities.F90#L49-L55
If it is not right, then hopefully it can also be fixed in that layer.
Just to let @sanAkel know, this should now be fixed. @atrayano found the fix and it's been applied to MAPL.
Thank you @mathomp4, @atrayano
Last night's (2022-Sep-15) nightly run of MOM6 with MAPL3 showed failure. The first error was:
But that apparently isn't full trapped up the tree, so it was eventually followed by:
The night before it worked with f9a99b79, so therefore the difference in MAPL is https://github.com/GEOS-ESM/MAPL/compare/f9a99b79...fcd7a312 (as fcd7a312 is the hash of MAPL3 as of this writing).
I have looked at the complete source code between two night's ago and last night and the only differences are in MAPL and they are the differences above.
As a test, I did a clone of MAPL3 GEOS and checked out f9a99b79 and built that and it does indeed work.
The thing is...this is a very boring chance to MAPL_Generic. It is just what is in #1672 which is...just removing duplicated code! I took a look with @bena-nasa and we do not see any mistake in that change (in
develop
orrelease/MAPL-v3
.I am...baffled.
I'll mention @tclune as well as @sanAkel just because it is MOM6 related...ish.