NOAA-EMC / WW3

WAVEWATCH III
Other
266 stars 542 forks source link

Crash in WMIOES due to tag number #711

Open benoitp-cmc opened 2 years ago

benoitp-cmc commented 2 years ago

Describe the bug

Crash encountered in the WMIOES routine of wminiomd for same rank grids when using a recent Intel compiler:

Abort(134850820) on node 76 (rank 76 in comm 0): Fatal error in PMPI_Isend: Invalid tag, error stack:
PMPI_Isend(162): MPI_Isend(buf=0xf95e880, count=1296, MPI_REAL, dest=10, tag=1500055, MPI_COMM_WORLD, request=0x4967900) failed
PMPI_Isend(95).: Invalid tag, value is 1500055

The MPI standard only guarantees 32K for tags. Apparently, we use more in this context.

To Reproduce

Steps to reproduce the behavior:

Probably not relevant:

Expected behavior The model to run to completion.

Additional context

While the MPI standard only guarantees 32K for tags, we can use more tags by setting some combination summing to 39 of the following variables:

export MPIR_CVAR_CH4_OFI_TAG_BITS=29
export MPIR_CVAR_CH4_OFI_RANK_BITS=10

With the above workaround, the model run completes as expected.

phil-blain commented 2 years ago

Note that this is using the Intel compiler as well as Intel MPI from the oneAPI toolkit mentioned above.

aliabdolali commented 2 years ago

@ukmo-ccbunney @mickaelaccensi @thesser1 @JessicaMeixner-NOAA any idea/suggestion?

It seems the problem is in for MTAG2 initial value=1.5M https://github.com/NOAA-EMC/WW3/blob/develop/model/src/wmmdatmd.F90/#L343 which ended in 1.5M+ in https://github.com/NOAA-EMC/WW3/blob/develop/model/src/wminiomd.F90/#L2712

ukmo-ccbunney commented 2 years ago

I had this issue many years ago on out Cray with version 3.14 of the model.

The MPI specification states that the MPI_TAG variable must have an upper bound of at least 32767 (essentially 15 bits, or a signed short). I believe this is because it forms part of a "mesasge envelope" in conjunction with the source, destination and communicator values.

Many MPI implementations allow for a much larger TAG value that 32767 - the MPICH implementation on our Cray XC allows for 21 bits. However, this mean that the MTAG2=3000000 value in WW3 vn3.14 was too large. I reduced the TAG numbers to the following to get our 3 grid multi-grid model working:

!/MPI      INTEGER, PARAMETER      :: MTAG0 = 1000000
!/MPI      INTEGER, PARAMETER      :: MTAG1 = 1100000  ! ChrisB: Lowered range 2-Feb-15 (was 2000000)
!/MPI      INTEGER, PARAMETER      :: MTAG2 = 1200000  ! ChrisB: Lowered range 2-Feb-15 (was 3000000)

These values might be too low though for models with more grids? I am not sure.

I note that the current MTAG2 value in the HEAD of develop is 1500000, which is low enough to fit in the 2^15 MPI_TAG set by Cray MPICH. I believe it has been at this value since WW3 v4.x (which is why v4+ has worked fine for us on the Cray)

However, it would only take a reduction of the tag size by 1 bit (2^20 = 1,048,576) to break this.

Ideally, we should be staying within the upper limit defined by the standard (32767), but this would require a rethink of how the tags are used in the multigrid model driver. Perhaps we could get around this by using more MPI Communicators (a naïve suggestion as I don't know much about the multigrid MPI implementation).

aliabdolali commented 2 years ago

Going back to a similar problem in Sep 2016 for Cray (DIFF): model/src/wmmdatmd.F90#L340-L344

The MTAG were changed from

!/MPI      INTEGER, PARAMETER      :: MTAGB =  900000
!/MPI      INTEGER, PARAMETER      :: MTAG0 = 1000000
!/MPI      INTEGER, PARAMETER      :: MTAG1 = 2000000
!/MPI      INTEGER, PARAMETER      :: MTAG2 = 3000000

to

      INTEGER, PARAMETER      :: MTAGB = 0   !< MTAGB
      INTEGER, PARAMETER      :: MTAG0 = 1000   !< MTAG0
      INTEGER, PARAMETER      :: MTAG1 = 100000   !< MTAG1
      INTEGER, PARAMETER      :: MTAG2 = 1500000   !< MTAG2
      INTEGER, PARAMETER      :: MTAG_UB = 2**21-1 !< MPI_TAG_UB on Cray XC40

Lowering MTAG2=500k solves the problem for the existing regtests + GFSv16 and GEFSv12 configurations (itag number does not exceed 216k for the largest grids we have). However, the tag number overlaps for larger grids.

There are multiple locations in model/src/wminiomd.F90 where ITAG is defined, and it is checked to make sure it does not exceed the next upper limit

            ITAG   = MTAG0 + IMOD + (J-1)*NRGRD
            IF ( ITAG .GT. MTAG1 ) THEN
IT0    = MTAG0 + NRGRD**2 + SUM(NBI2G(1:J-1,:)) +      &
                                      SUM(NBI2G(J,1:IMOD-1))
DO I=1, NBI2G(J,IMOD)
          DO IP=1, NMPROC
                ITAG   = IT0 + I
                IF ( ITAG .GT. MTAG1 ) THEN
ITAG   = MTAG0 + J + (IMOD-1)*NRGRD
            IF ( ITAG .GT. MTAG1 ) THEN
                IT0 = MTAG0 + NRGRD**2 + SUM(NBI2G(1:IMOD-1,:))  &
                                       + SUM(NBI2G(IMOD,1:J-1))
                DO I=1, NBI2G(IMOD,J)
                  ITAG   = IT0 + I
            IF ( ITAG .GT. MTAG1 ) THEN
        IT0    = MTAG1 + 1
         ITAG   = HGSTGE(J,IMOD)%ISEND(I,5) + IT0
            IF ( ITAG .GT. MTAG2 ) THEN
                IT0 = MTAG1 + 1
                    ITAG   = HGSTGE(IMOD,J)%ITAG(I,ILOC) + IT0
            IF ( ITAG .GT. MTAG2) THEN
        IT0 = MTAG2 + 1
          ITAG   = EQSTGE(J,IMOD)%STG(I) + IT0
          IF ( ITAG .GT. MTAG_UB ) THEN
                IT0 = MTAG2 + 1
                    ITAG   = EQSTGE(IMOD,J)%RTG(I,IA) + IT0
          IF ( ITAG .GT. MTAG_UB ) THEN
benoitp-cmc commented 2 years ago

Here is a more difficult test case including the 3rd grid in our setup: https://hpfx.collab.science.gc.ca/~bpo001/WW3/issue_711/same_rank_3_grids.tar.gz

With 10 CPU, this test case works but with 80 CPU it gives me:

Abort(1007266052) on node 27 (rank 27 in comm 0): Fatal error in PMPI_Isend: Invalid tag, error stack:
PMPI_Isend(162): MPI_Isend(buf=0x13159680, count=1296, MPI_REAL, dest=7, tag=1500059, MPI_COMM_WORLD, request=0x48526e0) failed
PMPI_Isend(95).: Invalid tag, value is 1500059
Abort(201959684) on node 17 (rank 17 in comm 0): Fatal error in PMPI_Isend: Invalid tag, error stack:
PMPI_Isend(162): MPI_Isend(buf=0x12658b60, count=1296, MPI_REAL, dest=7, tag=1500016, MPI_COMM_WORLD, request=0x3d556e0) failed
PMPI_Isend(95).: Invalid tag, value is 1500016
Abort(67741956) on node 71 (rank 71 in comm 0): Fatal error in PMPI_Isend: Invalid tag, error stack:
PMPI_Isend(162): MPI_Isend(buf=0x120da0a0, count=1296, MPI_REAL, dest=3, tag=1500079, MPI_COMM_WORLD, request=0x387cbc0) failed
PMPI_Isend(95).: Invalid tag, value is 1500079
MatthewMasarik-NOAA commented 2 years ago

@benoitp-cmc, I've updated the PR #825 description to note this issue is addressed but not resolved by it. Thank you for providing this updated test case. It will be useful to have as a good same rank sample case when opportunity or need arises to revisit this.