GEOS-ESM / GEOSgcm

GEOS Earth System Model GEOSgcm Fixture
Apache License 2.0
36 stars 13 forks source link

GEOSgcm Coupled Model Failing at NAS #766

Open mathomp4 opened 9 months ago

mathomp4 commented 9 months ago

After trying to fix up issues with the nightly tests at NAS and getting them working again, I've now found that the C12 MOM6 run at NAS is failing with:

MPT ERROR: Cannot create more than 2048 RMA windows.

As far as I can see from the logs, it was working on the 21st of February, so that would imply that v11.5.1 worked. I'll test to make sure.

That said, as far as I can remember, I don't think we've changed much in GEOS in re the Coupled model. There is a new MOM6 from @sanAkel but I don't see any one-sided MPI in MOM6 proper before or after the update.

Now, one suspicious part is that it is failing at 21z when a big HISTORY write occurs. These are the ref_time of 21z time-averaged collections. So, it's roughly the same collections in an AMIP run (as I think the time-averaged ocean collections have a ref_time of 0z -- or rather use the default).

I'll consult with @bena-nasa and @atrayano on this.

sanAkel commented 9 months ago

@mathomp4

  1. Comment out writes via history and see what happens.
  2. For the low res case, since we run it max 1 day, I only write 3 hourly prog and sfc
mathomp4 commented 9 months ago

@sanAkel I just tried the first (my first thought) and yep, it's fine. So that points to History aka MAPL. One of the changes in v11.5.2 was moving to MAPL 2.44. I don't recall any big one-sided changes in that, but then the innards of MAPL are a mysterious black box to me.

As for the second, I guess when I run MOM6 nightly, I run it like I do the AMIP runs. Turn on all the history (i.e., back to the old ways before monthly-by-default collections).

mathomp4 commented 9 months ago

I'm doing a test now of current GEOSgcm with MAPL 2.43.2 to see if MAPL 2.44 caused this, but back on Feb 21, MOM6 + MAPL develop worked, so if it is MAPL, it must have been something added to MAPL in the last few weeks?

mathomp4 commented 9 months ago

I might invoke @marshallward here as The MOM6 Guru I know of. Mainly, was there a change in MOM6 such that it is now using more RMA via FMS? https://github.com/mom-ocean/MOM6/pull/1616 looks "benign" to me in terms of MPI (heck, MOM6 doesn't do much MPI at all), but maybe something in there is now doing more halo updates in FMS or something and now History just adds enough extra RMA to trigger MPT? 🤷🏼

atrayano commented 9 months ago

I had come across similar issue. I solved mine by changing @cmake/compiler/flags/Intel_Fortran.cmake, effectively doing

set (COREAVX2_FLAG "")

mathomp4 commented 9 months ago

Update. If you build GEOSgcm but with MAPL 2.43.2, it doesn't crash.

I'm now going to try GEOSgcm with MAPL 2.44.0 but with the older Ocean/MOM6 before https://github.com/GEOS-ESM/GEOSgcm/pull/760 came in. That should narrow it down.

I mean, I can't think of anything else that could be relevant in recent updates.

marshallward commented 9 months ago

I'm not sure if I understand the problem, but there is no one-sided communication in FMS (which handles all of our MPI comms) and I doubt that the MOM6 communication burden has increased in any meaningful way. At most, there may be a change in the number of halo updates.

Maybe some of the default configurations have flipped from FMS1 to FMS2, but you may already be explicitly setting this to one or the other. Even then, there has been virtually no work on the MPI layer in FMS.

This looks like a very system-specific problem, but let me know if there is anything I can do to help.

mathomp4 commented 9 months ago

Okay. I just tried GEOSgcm + MAPL 2.44 + MOM6 geos/v2.2.3 and it fails. So it looks like it is a MAPL 2.44 + MPT + Coupled thing. GFDL is being nice with MPT.

Time to run more tests.