LLNL / SAMRAI

Structured Adaptive Mesh Refinement Application Infrastructure - a scalable C++ framework for block-structured AMR application development
https://computing.llnl.gov/projects/samrai
Other
224 stars 80 forks source link

MPI_ERR_TAG: invalid tag error #206

Closed nicolasaunai closed 1 year ago

nicolasaunai commented 2 years ago

Hi,

We're currently getting all of our runs failing with this error:

MPI error : "An error occurred in MPI_Irecv"  ; "MPI_ERR_TAG: invalid tag" 
"PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2198"

The code works for thousands of time steps on a tagged simulation with 3 levels. They crash with this error message all at about the same time. These tests models run on 20 cores at the moments. Not sure to what extent this is related, but we use the TreeLoadBalancer and BergerRigoutsos at the moment.

I have seen in SAMRAI's code base that some tags are computed, like

https://github.com/LLNL/SAMRAI/blob/aaa3853342a35581becb3268b842aa6e3de4e34e/source/SAMRAI/mesh/BalanceUtilities.cpp#L2301-L2302

A priori to be valid, tags should be between 0 and MPI_UB_TAG (which I don't know what the value is but probably larger than any tag you would like to compute).

Have you already experienced such issue and/or have some idea of what could go wrong or what thing could be worth investigating?

Thanks

nselliott commented 2 years ago

Are you able to identify which MPI_Irecv call is giving you this error? I haven't seen tag values go beyond the allowed MPI upper bound, but I do see some cases where the tag value is unexpectedly large during the execution of BergerRigoutsos, so I'm taking a look at that.

nicolasaunai commented 2 years ago

We have not yet identified the specific call. We have noticed on fewer occasions the invalid tag error also mentions Isend…. We have noticed also the same error happens when using TileClustering so it is probably not BergerRigoutsos. It never happens if max nbr level is 1. Other (quite weird) hint is that jobs seems to always crash around the same time (in our units :) t=85ish, if restarted from t=1 they re-crash around that time but if restarted at t=80 they go pass this crash time up to t=165 (which is ~ twice first crash time) and crash again. This makes me think of a memory leak but we have not found any at the moment and I don’t see how it could be the cause of an invalid tag error anyway….maybe something that grows over time…

nselliott commented 2 years ago

I found a past report from a user that was running out of valid tags for BergerRigoutsos, which turned out to be a result of their MPI installation having an unusually small value for MPI_TAG_UB. We didn't fix anything in SAMRAI in that case, as their solution was to use a better MPI installation. Other symptoms in their case don't match what you report, so I doubt this is what is causing your error, but you can check your MPI_TAG_UB value using a call to MPI_Attr_get(). In their case the error happened immediately instead of emerging over time, and TileClustering worked even with the small MPI_TAG_UB value.

I will keep checking to see if I can find anywhere that our calculated values for the MPI tag would tend to grow over time.

nicolasaunai commented 2 years ago

ok, I ran this:

#include <mpi.h>

#include <iostream>

int main(int argc, char** argv) {
    MPI_Init(&argc, &argv);
    int NoOfProcess;
    int ProcessNo = 0;
    int *tagUB, err;
    MPI_Comm_rank(MPI_COMM_WORLD, &ProcessNo);
    MPI_Comm_size(MPI_COMM_WORLD, &NoOfProcess);
    MPI_Comm_get_attr(MPI_COMM_WORLD, MPI_TAG_UB, &tagUB, &err);
    std::cout << ProcessNo << " among " << NoOfProcess
              << " and tagUb = " << *tagUB << " err = " << err << "\n";
    MPI_Finalize();
}

on my machine with openmpi 4.1.1 :

 mpirun -n 10 ./tag.exe

and got :

 5 among 10 and tagUb = 2147483647 err = 1
0 among 10 and tagUb = 2147483647 err = 1
1 among 10 and tagUb = 2147483647 err = 1
2 among 10 and tagUb = 2147483647 err = 1
3 among 10 and tagUb = 2147483647 err = 1
4 among 10 and tagUb = 2147483647 err = 1
6 among 10 and tagUb = 2147483647 err = 1
7 among 10 and tagUb = 2147483647 err = 1
9 among 10 and tagUb = 2147483647 err = 1
8 among 10 and tagUb = 2147483647 err = 1

did it again on the local cluster I'm running tests I mention in this issue with Open MPI 4.1.0

and got :

0 among 10 and tagUb = 8388607 err = 1
1 among 10 and tagUb = 8388607 err = 1
2 among 10 and tagUb = 8388607 err = 1
3 among 10 and tagUb = 8388607 err = 1
4 among 10 and tagUb = 8388607 err = 1
5 among 10 and tagUb = 8388607 err = 1
6 among 10 and tagUb = 8388607 err = 1
7 among 10 and tagUb = 8388607 err = 1
9 among 10 and tagUb = 8388607 err = 1
8 among 10 and tagUb = 8388607 err = 1

with a different version they propose (Intel(R) MPI Library for Linux* OS, Version 2019 Update 9 Build 20200923) I get :

-bash-4.2$ mpirun -n 10 ./tag.exe
3 among 10 and tagUb = 1048575 err = 1
2 among 10 and tagUb = 1048575 err = 1
9 among 10 and tagUb = 1048575 err = 1
7 among 10 and tagUb = 1048575 err = 1
0 among 10 and tagUb = 1048575 err = 1
8 among 10 and tagUb = 1048575 err = 1
4 among 10 and tagUb = 1048575 err = 1
5 among 10 and tagUb = 1048575 err = 1
6 among 10 and tagUb = 1048575 err = 1
1 among 10 and tagUb = 1048575 err = 1

So it turns out that indeed the upper bound tag is significantly lower on the cluster with perform tests on... I'll see with the admins why that is the case. My machine is a fedora workstation with a packaged openmpi install, nothing fancy done there, so I'm a bit puzzled why the cluster versions would have much lower tag limits.

In any case, It's still a bit weird that tags tend to only become invalid after a serious number of step.

One thing I should maybe say here, is that I observe in my current test runs that SAMRAI seems to tag and regrid the hierarchy every coarsest times or even more often for some levels. That seems excessive and is probably not the canonical way it is supposed to use (typically I would think one wants to tag at a pace that kind of depends on how fast the solution evolves, and I think I understood the times at which tagging occurs should be specified in the inputs of the StandardTagAndInitialize instance ?). This probably results in lots of calls to the load balancer (because the clustering does not seem to be the source of the problem I blame the balancer !) and ends up increasing some tags more than they would be supposed to in a "normal" usage...

nselliott commented 2 years ago

I found a place where SAMRAI is computing strictly increasing tag values where I think that reusing constant values will be entirely safe. It is possible that these values are eventually reaching the upper bound on your systems with smaller upper bounds. #209 has a preliminary fix, if you would like to try it.

nicolasaunai commented 2 years ago

No more crash with the proposed fix.