GEOS-ESM / GEOSgcm

GEOS Earth System Model GEOSgcm Fixture
Apache License 2.0
36 stars 13 forks source link

Error with MOM5 with current main #352

Closed mathomp4 closed 2 years ago

mathomp4 commented 3 years ago

The current GEOSgcm main (as of today) seems to have an issue running MOM5. But as near as I can tell nothing has fundamentally changed with MOM5! There were some changes to GEOSgcm_App from Ricardo, but I tested those in a MOM5 run yesterday and it seemed to work. I also did an xxdiff between a working run from last night to the current test and pretty much all the differences are in whitespace!

To wit as an example in this file (NOTE the SLURM file name will change everynight):

/discover/nobackup/mathomp4/SystemTests/runs/AGCM_MOM5/c90_MOM5_GOCART/CURRENT/run/1day/slurm-47377862.out

the error is:

 Integer*4 Resource Parameter: CICE_NDYN_DT:1
                                  Memuse(MB) at SEAICEMAPL_GenericInitialize=  3.498E+02  3.498E+02  3.318E+02  3.375E+02  0.000E+00
                                                           Mem/Swap Used (MB) at SEAICEMAPL_GenericInitialize=  1.548E+04  0.000E+00
                                                CommitLimit/Committed_AS (MB) at SEAICEMAPL_GenericInitialize=  1.801E+05  4.323E+04
NOTE from PE     0: diag_manager_mod::diag_manager_init: prepend_date only supported when diag_manager_init is called with time_init present.
MOMInitialize                                  877
MOMInitialize                                  877

Looking in GEOSgcm_GridComp:

! Check local sizes of two horizontal dimensions
!-----------------------------------------------

    call mom4_get_dimensions(isc, iec, jsc, jec, nk_out=LM)
    call MAPL_GridGet(GRID, localCellCountPerDim=counts, RC=status)
    VERIFY_(STATUS)

    IM=iec-isc+1
    JM=jec-jsc+1

    ASSERT_(counts(1)==IM)
    ASSERT_(counts(2)==JM)

the line in question (877) is:

    ASSERT_(counts(1)==IM)

This error did not happen in MAPL develop tests last night, so I can't blame MAPL.

I will probably be consulting @yvikhlya and @sanAkel about this.

mathomp4 commented 3 years ago

Well, I ran MOM5 at NAS on this "day of no NCCS" and it's duplicable.

To try and debug, first, I added some prints in that chunk of code above:

    write(*,*) "counts(1): ", counts(1)
    write(*,*) "counts(2): ", counts(2)
    write(*,*) "iec: ", iec, "isc: ", isc
    write(*,*) "jec: ", iec, "jsc: ", isc
    write(*,*) "IM :", IM
    write(*,*) "JM :", JM

Here counts is from MAPL and iec, jec, etc. come from MOM5 (via FMS I guess?). In the good run (v10.19.4):

 counts(1):           10
 counts(2):           20
 iec:          290 isc:          281
 jec:          290 jsc:          281
 IM :          10
 JM :          20

and on a bad run (main):

 counts(1):           10
 counts(2):           20
 iec:            0 isc:            0
 jec:            0 jsc:            0
 IM :           1
 JM :           1

So for some reason, MOM/FMS is returning...no grid? Or something.

And again...we did not touch MOM5. Or FMS.

And Ricardo's App changes are pretty boring. I did a test where I made an experiment in main but used the GEOSgcm.x from v10.19.4 and that is happy, so I can't see there being an issue in the gcm_setup phase.

yvikhlya commented 3 years ago

Looks like a problem with decomposition. isc, jsc, iec, jec are start and ens indexes of compute domain. IM, JM is size of the domain.

mathomp4 commented 3 years ago

Looks like a problem with decomposition. isc, jsc, iec, jec are start and ens indexes of compute domain. IM, JM is size of the domain.

@yvikhlya I agree but...nothing changed. My hope was that input.nml or something that helps control MOM got screwed up, but nope. Exactly the same!

My current fear is that this is one of those "We changed the memory state and things are running differently" things. As in, we can add a print statement in MOM5 somewhere and all will work again!

sanAkel commented 3 years ago

@mathomp4 I can try look into it with you on Monday next week - in a debug build.

sanAkel commented 2 years ago

We will get back to this issue on a later date. @mathomp4 closing it for now, please feel free to reopen if needed. Thanks!

yvikhlya commented 2 years ago

Seems like ocean_model_init did not run properly and grid dimensions which are returned by mom4_get_dimensions are junk. Now, why is that?

sanAkel commented 2 years ago

Seems like ocean_model_init did not run properly and grid dimensions which are returned by mom4_get_dimensions are junk. Now, why is that?

It may be easier or faster to know why using 1-deg resolution that @mathomp4 says has the same problem.

yvikhlya commented 2 years ago

I already have 0.25 degree set up and interactive session, so I don't see a need to switch to 1 degree.

yvikhlya commented 2 years ago

The last successful run with MOM5 I did was with v10.14.1 about 2 years ago. Something got broken since then.

yvikhlya commented 2 years ago

@mathomp4 Do you have any suggestion how to debug this? I can't think of anything better that put printouts inside of ocean_model_init.

mathomp4 commented 2 years ago

@yvikhlya Not really. When this happened I was just confused. It just sort of "happened" one night and the only changes I could see were whitespace changes! It was like all of the sudden the system decided to do this.

I suppose one possible thought is to try a run with GNU? Maybe it will show a different error? I am not sure.

yvikhlya commented 2 years ago

@mathomp4 Unrelated issue, but I can't push stuff to github today. I have my ssh rsa key uploaded to github and i was always to push without password, but today it asks me a password and then says that I need access token. How do you use github these days?

mathomp4 commented 2 years ago

@mathomp4 Unrelated issue, but I can't push stuff to github today. I have my ssh rsa key uploaded to github and i was always to push without password, but today it asks me a password and then says that I need access token. How do you use github these days?

If you are seeing "access token", you might have cloned the https URL instead of the SSH. You can run git remote -v to see what you have in that repo to confirm.

If that happened, you can switch your remote url with:

git remote set-url origin git@github.com:GEOS-ESM/GEOSgcm.git

where you change that to whichever repo you are in.

Now, if you are like me and never want an HTTPS url from github ever again, you can run:

git config --global url."git@github.com:".insteadOf "https://github.com/"

and from now on, git will always clone with SSH from github even if you accidentally pass it an HTTPS one!

yvikhlya commented 2 years ago

@mathomp4 Thanks! That was it.

sanAkel commented 2 years ago

@mathomp4 Do you have any suggestion how to debug this? I can't think of anything better that put printouts inside of ocean_model_init.

Well, 2 suggestions:

  1. Use the debugger- the debug build worked for me a few months ago.
  2. Again please use the 1-deg version. That will be easier to work with.
yvikhlya commented 2 years ago

@mathomp4 There is something wrong here. A printout from MOM5 run:

NOTE from PE     0: callTree: ---> ocean_model_init(), ocean_model_MOM.F90

ocean_model_MOM.F90 is a part of MOM6, not MOM5. MOM5 should search for ocean_model_init() in the ocean_model.F90. There is a name collision here.

P.S. Just verified that it runs MOM_GEOS5PlugMod.F90 (MOM5), but ocean_model_init from MOM6, not from MOM5.

sanAkel commented 2 years ago

@mathomp4 There is something wrong here. A printout from MOM5 run:


NOTE from PE     0: callTree: ---> ocean_model_init(), ocean_model_MOM.F90

ocean_model_MOM.F90 is a part of MOM6, not MOM5. MOM5 should search for ocean_model_init() in the ocean_model.F90. There is a name collision here.

P.S. Just verified that it runs MOM_GEOS5PlugMod.F90 (MOM5), but ocean_model_init from MOM6, not from MOM5.

Hmm! Maybe that shared object lib/ DSO stuff hitting us again?

mathomp4 commented 2 years ago

We might need to add back LD_PRELOAD?

yvikhlya commented 2 years ago

@mathomp4 Could you remind me how to use LD_PRELOAD in csh? It works in bash for me:

$ LD_PRELOAD=/home/yvikhlia/aogcm/coupled/S2Sv4/update030622/GEOSgcm/install-Debug/lib/libmom.so ldd GEOSgcm.x | grep libmom
        /home/yvikhlia/aogcm/coupled/S2Sv4/update030622/GEOSgcm/install-Debug/lib/libmom.so (0x00002b81fdeef000)
        libmom6.so => /home/yvikhlia/aogcm/coupled/S2Sv4/update030622/GEOSgcm/install-Debug/lib/libmom6.so (0x00002b82129ee000)

But gives error in csh:

> LD_PRELOAD=/home/yvikhlia/aogcm/coupled/S2Sv4/update030622/GEOSgcm/install-Debug/lib/libmom.so ldd GEOSgcm.x | grep libmom
LD_PRELOAD=/home/yvikhlia/aogcm/coupled/S2Sv4/update030622/GEOSgcm/install-Debug/lib/libmom.so: Command not found.
> set LD_PRELOAD=/home/yvikhlia/aogcm/coupled/S2Sv4/update030622/GEOSgcm/install-Debug/lib/libmom.so ldd GEOSgcm.x | grep libmom
set: Variable name must contain alphanumeric characters.
mathomp4 commented 2 years ago

@yvikhlya You have to use env:

env LD_PRELOAD=${GEOSDIR}/lib/libmom5.so ...
yvikhlya commented 2 years ago

LD_PRELOAD works! MOM5 initialized correctly. If this is a solution we are going to use, we need to update gcm_run.j and submit a PR (I can do it).

The model crashed in land component though with error:

<CATCH_INTERNAL_RST is NOT consistent with VEGDYN Data>

This is a whole separate issue, something is wrong with restarts which we generated with @sanAkel last week. I am investigating this issue.

mathomp4 commented 2 years ago

LD_PRELOAD works! MOM5 initialized correctly. If this is a solution we are going to use, we need to update gcm_run.j and submit a PR (I can do it).

Nice! I suppose a simple "If MOM5, add LD_PRELOAD" can work.

The model crashed in land component tough with error:

<CATCH_INTERNAL_RST is NOT consistent with VEGDYN Data>

Ouch. Yeah. That's when I start asking around!

sanAkel commented 2 years ago

I can confirm that works for both: