Open benoitp-cmc opened 2 years ago
Note that this is using the Intel compiler as well as Intel MPI from the oneAPI toolkit mentioned above.
The regtests with intel 2022 and impi 2022 with multi grids fail with tag=1500000+. Digging more, it seems the problem is https://github.com/NOAA-EMC/WW3/blob/develop/model/src/wminiomd.F90/#L2712 the IT0=MTAG2+1 and therefore ITAG > 1500000+ The MTAG2 initial value=1500000 is defined in https://github.com/NOAA-EMC/WW3/blob/develop/model/src/wmmdatmd.F90/#L343 We can lower MTAG2 initial value=1500000, but what number would be acceptable without overlapping with other itags, not for small regtests but for large operational applications..
The other workaround as suggested here is: export MPIR_CVAR_CH4_OFI_TAG_BITS=28 export MPIR_CVAR_CH4_OFI_RANK_BITS=11
@ukmo-ccbunney @mickaelaccensi @thesser1 @JessicaMeixner-NOAA any idea/suggestion?
It seems the problem is in for MTAG2 initial value=1.5M https://github.com/NOAA-EMC/WW3/blob/develop/model/src/wmmdatmd.F90/#L343 which ended in 1.5M+ in https://github.com/NOAA-EMC/WW3/blob/develop/model/src/wminiomd.F90/#L2712
I had this issue many years ago on out Cray with version 3.14 of the model.
The MPI specification states that the MPI_TAG variable must have an upper bound of at least 32767 (essentially 15 bits, or a signed short). I believe this is because it forms part of a "mesasge envelope" in conjunction with the source, destination and communicator values.
Many MPI implementations allow for a much larger TAG value that 32767 - the MPICH implementation on our Cray XC allows for 21 bits. However, this mean that the MTAG2=3000000
value in WW3 vn3.14 was too large. I reduced the TAG numbers to the following to get our 3 grid multi-grid model working:
!/MPI INTEGER, PARAMETER :: MTAG0 = 1000000
!/MPI INTEGER, PARAMETER :: MTAG1 = 1100000 ! ChrisB: Lowered range 2-Feb-15 (was 2000000)
!/MPI INTEGER, PARAMETER :: MTAG2 = 1200000 ! ChrisB: Lowered range 2-Feb-15 (was 3000000)
These values might be too low though for models with more grids? I am not sure.
I note that the current MTAG2 value in the HEAD of develop is 1500000, which is low enough to fit in the 2^15 MPI_TAG set by Cray MPICH. I believe it has been at this value since WW3 v4.x (which is why v4+ has worked fine for us on the Cray)
However, it would only take a reduction of the tag size by 1 bit (2^20 = 1,048,576) to break this.
Ideally, we should be staying within the upper limit defined by the standard (32767), but this would require a rethink of how the tags are used in the multigrid model driver. Perhaps we could get around this by using more MPI Communicators (a naïve suggestion as I don't know much about the multigrid MPI implementation).
Going back to a similar problem in Sep 2016 for Cray (DIFF): model/src/wmmdatmd.F90#L340-L344
The MTAG were changed from
!/MPI INTEGER, PARAMETER :: MTAGB = 900000
!/MPI INTEGER, PARAMETER :: MTAG0 = 1000000
!/MPI INTEGER, PARAMETER :: MTAG1 = 2000000
!/MPI INTEGER, PARAMETER :: MTAG2 = 3000000
to
INTEGER, PARAMETER :: MTAGB = 0 !< MTAGB
INTEGER, PARAMETER :: MTAG0 = 1000 !< MTAG0
INTEGER, PARAMETER :: MTAG1 = 100000 !< MTAG1
INTEGER, PARAMETER :: MTAG2 = 1500000 !< MTAG2
INTEGER, PARAMETER :: MTAG_UB = 2**21-1 !< MPI_TAG_UB on Cray XC40
Lowering MTAG2=500k solves the problem for the existing regtests + GFSv16 and GEFSv12 configurations (itag number does not exceed 216k for the largest grids we have). However, the tag number overlaps for larger grids.
There are multiple locations in model/src/wminiomd.F90
where ITAG is defined, and it is checked to make sure it does not exceed the next upper limit
ITAG = MTAG0 + IMOD + (J-1)*NRGRD
IF ( ITAG .GT. MTAG1 ) THEN
IT0 = MTAG0 + NRGRD**2 + SUM(NBI2G(1:J-1,:)) + &
SUM(NBI2G(J,1:IMOD-1))
DO I=1, NBI2G(J,IMOD)
DO IP=1, NMPROC
ITAG = IT0 + I
IF ( ITAG .GT. MTAG1 ) THEN
ITAG = MTAG0 + J + (IMOD-1)*NRGRD
IF ( ITAG .GT. MTAG1 ) THEN
IT0 = MTAG0 + NRGRD**2 + SUM(NBI2G(1:IMOD-1,:)) &
+ SUM(NBI2G(IMOD,1:J-1))
DO I=1, NBI2G(IMOD,J)
ITAG = IT0 + I
IF ( ITAG .GT. MTAG1 ) THEN
IT0 = MTAG1 + 1
ITAG = HGSTGE(J,IMOD)%ISEND(I,5) + IT0
IF ( ITAG .GT. MTAG2 ) THEN
IT0 = MTAG1 + 1
ITAG = HGSTGE(IMOD,J)%ITAG(I,ILOC) + IT0
IF ( ITAG .GT. MTAG2) THEN
IT0 = MTAG2 + 1
ITAG = EQSTGE(J,IMOD)%STG(I) + IT0
IF ( ITAG .GT. MTAG_UB ) THEN
IT0 = MTAG2 + 1
ITAG = EQSTGE(IMOD,J)%RTG(I,IA) + IT0
IF ( ITAG .GT. MTAG_UB ) THEN
Here is a more difficult test case including the 3rd grid in our setup: https://hpfx.collab.science.gc.ca/~bpo001/WW3/issue_711/same_rank_3_grids.tar.gz
With 10 CPU, this test case works but with 80 CPU it gives me:
Abort(1007266052) on node 27 (rank 27 in comm 0): Fatal error in PMPI_Isend: Invalid tag, error stack:
PMPI_Isend(162): MPI_Isend(buf=0x13159680, count=1296, MPI_REAL, dest=7, tag=1500059, MPI_COMM_WORLD, request=0x48526e0) failed
PMPI_Isend(95).: Invalid tag, value is 1500059
Abort(201959684) on node 17 (rank 17 in comm 0): Fatal error in PMPI_Isend: Invalid tag, error stack:
PMPI_Isend(162): MPI_Isend(buf=0x12658b60, count=1296, MPI_REAL, dest=7, tag=1500016, MPI_COMM_WORLD, request=0x3d556e0) failed
PMPI_Isend(95).: Invalid tag, value is 1500016
Abort(67741956) on node 71 (rank 71 in comm 0): Fatal error in PMPI_Isend: Invalid tag, error stack:
PMPI_Isend(162): MPI_Isend(buf=0x120da0a0, count=1296, MPI_REAL, dest=3, tag=1500079, MPI_COMM_WORLD, request=0x387cbc0) failed
PMPI_Isend(95).: Invalid tag, value is 1500079
@benoitp-cmc, I've updated the PR #825 description to note this issue is addressed but not resolved by it. Thank you for providing this updated test case. It will be useful to have as a good same rank sample case when opportunity or need arises to revisit this.
Describe the bug
Crash encountered in the
WMIOES
routine ofwminiomd
for same rank grids when using a recent Intel compiler:The MPI standard only guarantees 32K for tags. Apparently, we use more in this context.
To Reproduce
Steps to reproduce the behavior:
Probably not relevant:
Expected behavior The model to run to completion.
Additional context
While the MPI standard only guarantees 32K for tags, we can use more tags by setting some combination summing to 39 of the following variables:
With the above workaround, the model run completes as expected.