GEOS-ESM / MAPL

MAPL is a foundation layer of the GEOS architecture, whose original purpose is to supplement the Earth System Modeling Framework (ESMF)
https://geos-esm.github.io/MAPL/
Apache License 2.0
25 stars 18 forks source link

MAPL Develop runs crashing at NAS #2515

Closed mathomp4 closed 10 months ago

mathomp4 commented 10 months ago

As I come back from the break, my first task is all ready to go. All the MAPL develop (and thus MAPL3) runs are dying in my nightly tests at NAS.

Well, not all. The mom6 runs seem to run. So that provides a clue.

My guess is this has to do with the fact we are now running with MPI_THREAD_MULTIPLE by default (via @aoloso PR). Should be fixable...I hope.

The error is seen after reading the History rc files and then it crashes:


 Reading HISTORY RC Files:
 -------------------------
 NOT using buffer I/O for file: HISTORY.rc
 NOT using buffer I/O for file: geosgcm_prog.rcx
 NOT using buffer I/O for file: geosgcm_surf.rcx
 NOT using buffer I/O for file: geosgcm_ocn.rcx
 NOT using buffer I/O for file: geosgcm_moist.rcx
 NOT using buffer I/O for file: geosgcm_turb.rcx
 NOT using buffer I/O for file: geosgcm_gwd.rcx
 NOT using buffer I/O for file: geosgcm_tend.rcx
 NOT using buffer I/O for file: geosgcm_budi.rcx
 NOT using buffer I/O for file: geosgcm_buda.rcx
 NOT using buffer I/O for file: geosgcm_landice.rcx
 NOT using buffer I/O for file: geosgcm_meltwtr.rcx
 NOT using buffer I/O for file: geosgcm_snowlayer.rcx
 NOT using buffer I/O for file: geosgcm_tracer.rcx
 NOT using buffer I/O for file: tavg2d_aer_x.rcx
 NOT using buffer I/O for file: tavg3d_aer_p.rcx
 NOT using buffer I/O for file: HISTORY.rc
...
MPT ERROR: Could not register RMA window with the HCA. There may not be
    enough memory.
MPT ERROR: Assertion failed at xp.c:188: "att != (void *)-1"
mathomp4 commented 10 months ago

What we know: MAPL 2.42.3 works, MAPL develop (as of 2024-01-02) does not.

Tests to be done:

See where the failure first occurs.

bena-nasa commented 10 months ago

I wonder if this is a single vs multiple node issue, the model (v11.4.0) with MAPL develop c24 at 2x12 ran just fine past History, c24 at 3x24 gave the same error as reported in the first post.

On the other hand I ran ExtDataDriver.x on multiple nodes and that ran, so I guess it is time to figure out what in the real History RC

tclune commented 10 months ago

One possibility is that there is something in the new ESMF support for SSI and that this is breaking under MPT ...

mathomp4 commented 10 months ago

Well, @bena-nasa and I found a fix with MPT flags here:

https://github.com/GEOS-ESM/GEOSgcm_App/pull/553