GEOS-ESM / MAPL

MAPL is a foundation layer of the GEOS architecture, whose original purpose is to supplement the Earth System Modeling Framework (ESMF)
https://geos-esm.github.io/MAPL/
Apache License 2.0
25 stars 18 forks source link

GNU MAPL3 Release Runtime Failure #2891

Open mathomp4 opened 4 months ago

mathomp4 commented 4 months ago

My nightly tests have shown that GEOSgcm running MAPL3 with GNU and Release crashes at the end of ExtData:

        EXTDATA: INFO: TR_regionMask updated L bracket with: ExtData/g5chem/sfc/RADON.region_mask.x540_y361.2001.nc at time index   1
        EXTDATA: INFO: TR_regionMask updated R bracket with: ExtData/g5chem/sfc/RADON.region_mask.x540_y361.2001.nc at time index   1
[borgi187:41259] *** An error occurred in MPI_Wait
[borgi187:41259] *** reported by process [2683437057,0]
[borgi187:41259] *** on communicator MPI COMMUNICATOR 22 CREATE FROM 21
[borgi187:41259] *** MPI_ERR_TRUNCATE: message truncated
[borgi187:41259] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[borgi187:41259] ***    and potentially your MPI job)
>> Error << /discover/swdev/gmao_SIteam/MPI/openmpi/4.1.5/gcc-13.1.0/bin/mpirun  -np 96 /discover/nobackup/mathomp4/SystemTests/runs/AGCM_GNUMAPL3/c24_O1_GOCART/CURRENT/run/1day/scratch/GEOSgcm.x --logging_config logging.yaml: status = 15; at /gpfsm/dnb34/mathomp4/SystemTests/builds/AGCM_GNUMAPL3/CURRENT/GEOSgcm/install-Release/bin/esma_mpirun line 377.
GEOSgcm Run Status: -1

I looked back and this was working as of June 23, failing on June 24.

Not much has gone into MAPL3 since then, mainly stuff from @metdyn ... but I'm not exercising that!

mathomp4 commented 4 months ago

Confirmed that 27d47d4a3b4b9e2f426186a02a74cc4649432f43 works but 4486933fe6e7f0fcf7a122da7d65e11b650d87f0 does not. So something between causes it:

https://github.com/GEOS-ESM/MAPL/compare/27d47d4a3b4b9e2f426186a02a74cc4649432f43...4486933fe6e7f0fcf7a122da7d65e11b650d87f0

but that is (essentially) #2838 via #2888. And I'm not using any of the @metdyn code! Aaaa!

mathomp4 commented 4 months ago

I'm adding @tclune to this because I am confused.

mathomp4 commented 4 months ago

Indeed, as @atrayano saw, I can run this code with History OFF and it fails. And yet all the changes in 4486933fe6e7f0fcf7a122da7d65e11b650d87f0 were in History! Aaaaa!

mathomp4 commented 4 months ago

Note: if you turn off ExtData, it does run. So it seems like ExtData is the issue...but then this:

https://github.com/GEOS-ESM/MAPL/compare/27d47d4a3b4b9e2f426186a02a74cc4649432f43...4486933fe6e7f0fcf7a122da7d65e11b650d87f0

has no changes!

mathomp4 commented 4 months ago

New update! If I build MAPL3 GEOSgcm with GNU and use my Aggressive flags, it works! This is really looking like one of those "memory got mooshed around" sort of things (like GNU + MOM6 which randomly works then fails then works...)

mathomp4 commented 4 months ago

Well, I tried GNU but where Release uses -O2 instead of -O3 but that still fails. So huh.

The regular release flags are (excluding flags common to Release and Aggressive):

Fortran_FLAGS = -O3 -march=znver2 -mtune=generic -funroll-loops -ffpe-trap=zero,overflow 

and the aggressive are:

Fortran_FLAGS = -O2 -march=native -ffast-math -ftree-vectorize -funroll-loops --param max-unroll-times=4  -mno-fma

So, I guess maybe I'll try a run without ffpe-trap?

ETA: Didn't help. 😞

bena-nasa commented 4 months ago

I tried running ExtDataDriver.x from the model build with release/MAPL-v3 in my "simulate gocart" mode. I.E. run with the same inputs to extdata the real model uses. Ran fine, definitely seems like a "memory got smooshed" issue.

mathomp4 commented 4 months ago

Dang. I might need to just fiddle with the flags in various places. I guess the ExtData gridcomp is the place to start