Closed mathomp4 closed 2 years ago
Well, I ran MOM5 at NAS on this "day of no NCCS" and it's duplicable.
To try and debug, first, I added some prints in that chunk of code above:
write(*,*) "counts(1): ", counts(1)
write(*,*) "counts(2): ", counts(2)
write(*,*) "iec: ", iec, "isc: ", isc
write(*,*) "jec: ", iec, "jsc: ", isc
write(*,*) "IM :", IM
write(*,*) "JM :", JM
Here counts
is from MAPL and iec
, jec
, etc. come from MOM5 (via FMS I guess?). In the good run (v10.19.4):
counts(1): 10
counts(2): 20
iec: 290 isc: 281
jec: 290 jsc: 281
IM : 10
JM : 20
and on a bad run (main
):
counts(1): 10
counts(2): 20
iec: 0 isc: 0
jec: 0 jsc: 0
IM : 1
JM : 1
So for some reason, MOM/FMS is returning...no grid? Or something.
And again...we did not touch MOM5. Or FMS.
And Ricardo's App changes are pretty boring. I did a test where I made an experiment in main
but used the GEOSgcm.x
from v10.19.4 and that is happy, so I can't see there being an issue in the gcm_setup
phase.
Looks like a problem with decomposition. isc, jsc, iec, jec are start and ens indexes of compute domain. IM, JM is size of the domain.
Looks like a problem with decomposition. isc, jsc, iec, jec are start and ens indexes of compute domain. IM, JM is size of the domain.
@yvikhlya I agree but...nothing changed. My hope was that input.nml
or something that helps control MOM got screwed up, but nope. Exactly the same!
My current fear is that this is one of those "We changed the memory state and things are running differently" things. As in, we can add a print statement in MOM5 somewhere and all will work again!
@mathomp4 I can try look into it with you on Monday next week - in a debug build.
We will get back to this issue on a later date. @mathomp4 closing it for now, please feel free to reopen if needed. Thanks!
Seems like ocean_model_init did not run properly and grid dimensions which are returned by mom4_get_dimensions are junk. Now, why is that?
Seems like ocean_model_init did not run properly and grid dimensions which are returned by mom4_get_dimensions are junk. Now, why is that?
It may be easier or faster to know why using 1-deg resolution that @mathomp4 says has the same problem.
I already have 0.25 degree set up and interactive session, so I don't see a need to switch to 1 degree.
The last successful run with MOM5 I did was with v10.14.1 about 2 years ago. Something got broken since then.
@mathomp4 Do you have any suggestion how to debug this? I can't think of anything better that put printouts inside of ocean_model_init.
@yvikhlya Not really. When this happened I was just confused. It just sort of "happened" one night and the only changes I could see were whitespace changes! It was like all of the sudden the system decided to do this.
I suppose one possible thought is to try a run with GNU? Maybe it will show a different error? I am not sure.
@mathomp4 Unrelated issue, but I can't push stuff to github today. I have my ssh rsa key uploaded to github and i was always to push without password, but today it asks me a password and then says that I need access token. How do you use github these days?
@mathomp4 Unrelated issue, but I can't push stuff to github today. I have my ssh rsa key uploaded to github and i was always to push without password, but today it asks me a password and then says that I need access token. How do you use github these days?
If you are seeing "access token", you might have cloned the https URL instead of the SSH. You can run git remote -v
to see what you have in that repo to confirm.
If that happened, you can switch your remote url with:
git remote set-url origin git@github.com:GEOS-ESM/GEOSgcm.git
where you change that to whichever repo you are in.
Now, if you are like me and never want an HTTPS url from github ever again, you can run:
git config --global url."git@github.com:".insteadOf "https://github.com/"
and from now on, git will always clone with SSH from github even if you accidentally pass it an HTTPS one!
@mathomp4 Thanks! That was it.
@mathomp4 Do you have any suggestion how to debug this? I can't think of anything better that put printouts inside of ocean_model_init.
Well, 2 suggestions:
@mathomp4 There is something wrong here. A printout from MOM5 run:
NOTE from PE 0: callTree: ---> ocean_model_init(), ocean_model_MOM.F90
ocean_model_MOM.F90
is a part of MOM6, not MOM5. MOM5 should search for ocean_model_init()
in the ocean_model.F90
. There is a name collision here.
P.S. Just verified that it runs MOM_GEOS5PlugMod.F90 (MOM5), but ocean_model_init
from MOM6, not from MOM5.
@mathomp4 There is something wrong here. A printout from MOM5 run:
NOTE from PE 0: callTree: ---> ocean_model_init(), ocean_model_MOM.F90
ocean_model_MOM.F90
is a part of MOM6, not MOM5. MOM5 should search forocean_model_init()
in theocean_model.F90
. There is a name collision here.P.S. Just verified that it runs MOM_GEOS5PlugMod.F90 (MOM5), but
ocean_model_init
from MOM6, not from MOM5.
Hmm! Maybe that shared object lib/ DSO stuff hitting us again?
We might need to add back LD_PRELOAD
?
@mathomp4 Could you remind me how to use LD_PRELOAD in csh? It works in bash for me:
$ LD_PRELOAD=/home/yvikhlia/aogcm/coupled/S2Sv4/update030622/GEOSgcm/install-Debug/lib/libmom.so ldd GEOSgcm.x | grep libmom
/home/yvikhlia/aogcm/coupled/S2Sv4/update030622/GEOSgcm/install-Debug/lib/libmom.so (0x00002b81fdeef000)
libmom6.so => /home/yvikhlia/aogcm/coupled/S2Sv4/update030622/GEOSgcm/install-Debug/lib/libmom6.so (0x00002b82129ee000)
But gives error in csh:
> LD_PRELOAD=/home/yvikhlia/aogcm/coupled/S2Sv4/update030622/GEOSgcm/install-Debug/lib/libmom.so ldd GEOSgcm.x | grep libmom
LD_PRELOAD=/home/yvikhlia/aogcm/coupled/S2Sv4/update030622/GEOSgcm/install-Debug/lib/libmom.so: Command not found.
> set LD_PRELOAD=/home/yvikhlia/aogcm/coupled/S2Sv4/update030622/GEOSgcm/install-Debug/lib/libmom.so ldd GEOSgcm.x | grep libmom
set: Variable name must contain alphanumeric characters.
@yvikhlya You have to use env
:
env LD_PRELOAD=${GEOSDIR}/lib/libmom5.so ...
LD_PRELOAD works! MOM5 initialized correctly. If this is a solution we are going to use, we need to update gcm_run.j and submit a PR (I can do it).
The model crashed in land component though with error:
<CATCH_INTERNAL_RST is NOT consistent with VEGDYN Data>
This is a whole separate issue, something is wrong with restarts which we generated with @sanAkel last week. I am investigating this issue.
LD_PRELOAD works! MOM5 initialized correctly. If this is a solution we are going to use, we need to update gcm_run.j and submit a PR (I can do it).
Nice! I suppose a simple "If MOM5, add LD_PRELOAD" can work.
The model crashed in land component tough with error:
<CATCH_INTERNAL_RST is NOT consistent with VEGDYN Data>
Ouch. Yeah. That's when I start asking around!
The current GEOSgcm
main
(as of today) seems to have an issue running MOM5. But as near as I can tell nothing has fundamentally changed with MOM5! There were some changes to GEOSgcm_App from Ricardo, but I tested those in a MOM5 run yesterday and it seemed to work. I also did anxxdiff
between a working run from last night to the current test and pretty much all the differences are in whitespace!To wit as an example in this file (NOTE the SLURM file name will change everynight):
the error is:
Looking in GEOSgcm_GridComp:
the line in question (877) is:
This error did not happen in MAPL
develop
tests last night, so I can't blame MAPL.I will probably be consulting @yvikhlya and @sanAkel about this.