For sane uniform error handling we need an MPI Barrier with timeout.

PDoakORNL commented 2 years ago

This can be accomplished using MPI_ibarrier. It's probably not too hard to add support for this to mpi3 although it's clearly complicated by mpi3's communication modes. Alfredo I'm not really grasping how the sync and async MPI calls and the various apparently sync and async communication modes interact.

Our use case is after a fatal error has occurred which is likely to occur on all nodes and therefore frequently preempt reporting by the head node. So we could just spin on an MPI_probe for a specified timeout the process that would be wasting CPU spinning is already "dead"

Still we only need an ibarrier call that returns a request and a probe command and we can implement the timeout and spin at the application level.

PDoakORNL commented 2 years ago

This would allow us to use UniformCommunicateError as intended and stop swallowing output the user should be getting.

prckent commented 2 years ago

The most frequent use case for this is likely in initialization/startup. We don't know the order different ranks will reach problems (user input error, memory availability). While problems can occur later on, we need the user to see what went wrong e.g. in their whole machine run with 1M ranks.

ye-luo commented 2 years ago

I'm not following what is the issue to solve and what is the advantage of the proposed method? Could you elaborate a bit more?

correaa commented 2 years ago

yes, can you elaborate? i am very interested in implementing something like this or even at a (slightly) higher level if i see the bigger picture.

if i understand correctly you want to report an error and issue an ibarrier at that point. while succeeding processes will just issue an ibarrier on success?

an example with pseudo code can also help me understand the pattern if you have one in mind.

On Mon, Jan 24, 2022 at 4:22 PM Ye Luo @.***> wrote:

I'm not following what is the issue to solve and what is the advantage of the proposed method? Could you elaborate a bit more?

— Reply to this email directly, view it on GitHub https://github.com/QMCPACK/qmcpack/issues/3760#issuecomment-1020683288, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAXICU6WFMFNDWXHJIWEMR3UXXUOBANCNFSM5MWTENPQ . You are receiving this because you were assigned.Message ID: @.***>

PDoakORNL commented 2 years ago

We through UniformCommunicateError when we have an exception state that we know every rank will experience. We need to timeout to deal with the edge case of a rank not making it to that error case. Elsewise @ye-luo points out this can hang until the job times out.

Hanging and wasting huge amounts of HPC time being a more severe failure than losing error output.

void barrierAndAbortWithTimeOut(std::string error, int timeout)
{ 
 auto req_barrier = mpi3_comm.iBarrier();
  auto timeout = input_.get_error_timeout();
  auto wait_start = time();
  while ( time() - wait_start < timeoutt)
  {
    if( mpi3_comm.probe(req_barrier()) == EVERYONE_HAS_ENTERED)
    {
      mpi3_comm.wait(req_barrier); // seems like we should still clear request
      if(mpi3.rank() == 0) 
        mpi_abort(uce.what());
    }
  }
  mpi_abort("timeout error") // at this point we just need to make sure the abort occurs.
}

try
{
  validateXML(cur);
}
catch(const UniformCommunicateError& uce)
{
  barrierAndAbortWithTimeOut(uce.what(), input_.get_error_timeout());
}

PDoakORNL commented 2 years ago

Completely possible this would be better accomplished just with regular async messages, but I think there is a way to use the ibarrier request to avoid some of this work.

williamfgc commented 2 years ago

Question out of curiosity as I might be missing something in @PDoakORNL requirements. Shouldn't MPI_Abort be sufficient? It's not clean (can be put inside a catch block). Still it should do the job of not wasting core-hours at the expense of not reporting all errors, but only the first one. Or reporting every single rank that fail is the desired outcome?

correaa commented 2 years ago

That's a good question, what it gaining by delaying the MPI_Abort, is it that messages are more clear.

I see main two situations:

all processes globally fail in the same state, they all throw the same global exception type (and contents), I don't think this any special handling (global code should be normal code).
some processes fail, if this one throws a normal exception it is locally impossible to determine if it is global. So the programmer should either
- MPI_Abort
- rethrow as a special exception with MPI information that can be handled later, either collect exception information or MPI_Abort (are we here?)
- Delay abort form some reason (are we here?)
some processes fail, and know that others will also eventually fail so all wait so they can all abort at the same "time". (are we here?)

PDoakORNL commented 2 years ago

The first case is the concern here.

In the first case they don't all fail at that uniform error. The first that fails calls MPIAbort and everything else goes does because some other process aborted.

correaa commented 2 years ago

Yes, I understand. Please let me review the logic.

So, the motivation is that we are at the point that if anything failed (even locally) the whole MPI program must terminate.

The idea is that there must be a Barrier just before about so all do what they need to do (e.g. even just printing the same message before abort). If we are sure all processes will fail eventually, (e.g. a file that all read is missing), a normal Barrier should be enough for this...

... However, there is small risk (e.g. 1 in 100) that we are not completely sure (locally) if all failed really so putting a barrier is too strong and can risk wasting 12 hours of HPC time. So what we do is to something intermediate, we are sure "almost" sure that all will fail but trying to issue an Abort but after a Barrier is "tried" for while (e.g. 2 minutes).

Is that correct?

(We can propose IAbort to MPI 5.0 ;) )

correaa commented 2 years ago

Ok, I have been doing some systematic exploration.

This is what I think this is useful for. We have to recognize that this is an optimization on top of a failure case. In general optimizing for a failure case is something we shouldn't do, but with HPC life is different. Seems that failing fast is part of trade.

Let's say one has 12 hours of HPC, and there is a failure, which might be global or not, we think it is global but there is a small change that it is not. But in any case we want to abort the whole program and not continue any further locally.

If we issue Abort unconditionally we will miss the opportunity for other processes to confirm that the error is global, or give their own details, or perform their shutdown operations. Even having more than one error message in the log from different processes can help us determine if the error was global or not.

The alternative to Abort is to spin at give the other processes their whole 12 hours to do this, we know it is a waste, so we introduce a time out, let's say 5 minutes.

At this point there is an option that is not mentioned so far that is to just put a sleep before Abort.

std::this_thread::sleep_for(5m);
ABORT_MPI_HERE;

(note that there is no barrier)

This already fits the bill IMO. If there is a failure, we don't waste 12h and we give some time (5m) for other processes to shutdown.

So far so good. The problem with this is that, for this handling, it will always take 5 minutes to shutdown after the first error in any process is found.

So we want to "optimize" for this case. If all are failing quickly, we want a way to further optimize these 5m.

With IBarrier this can be implemented this way:

        auto rbarrier = comm.ibarrier();
        auto const t0 = mpi3::wall_time();
        using namespace std::chrono_literals;
        while(not rbarrier.completed() and (mpi3::wall_time() - t0) < 5m) {}
        comm.abort();

This way we are, not only saving 12h - 5min or runtime (assuming the error is at the start) but almost the whole 12h if the failure was really uniform.

I coded all the possible cases for uniform_fail and nonuniform_fail and different ways to handle that I could imagine. See code below. The interesting case is the last block. Note that exceptions have nothing to do with the different strategies to handle this. (It was easier to code with errors are exceptions, but it can error codes as well.)

There some outstanding problems I am trying to understand, 0) IBarrier is not fundamental to implement this, if there is something to communicate, like error code IBarrier can be replaced by a IReduce and if it succeeds, one can take advantage of this. 1) if communicator is not world, are we gaining something by this? Yes, we can print all "local" errors inside a subgroup, but other processes in other communicator will not shutdown in any way. So the gain is less. 2) if communicator IS world (or congruent), and we know that the ibarrier was not timed out, why call MPI_Abort at all? We can call terminate and since all processes are calling, or exit with error code, or throw (the same exception) which, if not captured will shutdown the environment without calling abort.

the code:

#include "../../mpi3/main.hpp"
#include "../../mpi3/communicator.hpp"

#include <chrono>
#include <stdexcept>  // std::runtime_error
#include <thread>

namespace mpi3 = boost::mpi3;

// failures

void uniform_fail(mpi3::communicator& comm) {
    using namespace std::chrono_literals;
    std::this_thread::sleep_for(comm.rank() * 1s);

    std::cout<< "uniform_fail in n = "<< comm.rank() <<" is about to fail" <<std::endl;
    throw std::logic_error{"global but unsynchronized error"};
}

void nonuniform_fail(mpi3::communicator& comm) {
    using namespace std::chrono_literals;
    std::this_thread::sleep_for(comm.rank() * 1s);

    if(comm.rank() > 2){
        std::cout<< "nonuniform_fail in n = "<< comm.rank() <<" is about to fail" <<std::endl;
        throw std::logic_error{"nonglobal error"};
    }
}

// handlers

void unconditional_abort(mpi3::communicator& comm) {
    std::cout<< "not essential message: aborting from rank "<< comm.rank() <<std::endl;
    comm.abort();
}

void barriered_abort(mpi3::communicator& comm) {
    comm.barrier();
    std::cout<< "not essential message: aborting from rank "<< comm.rank() <<std::endl;
    comm.abort();
}

template<class Duration>
void abort_after(mpi3::communicator& comm, Duration d) {
    auto const t0 = mpi3::wall_time();
    while((mpi3::wall_time() - t0) < d) {}
    std::cout<< "not essential message: aborting from rank "<< comm.rank() <<" after others join"<<std::endl;
    comm.abort();
}

template<class Duration>
void timedout_abort(mpi3::communicator& comm, Duration d) {
    auto rbarrier = comm.ibarrier();
    auto const t0 = mpi3::wall_time();
    while(not rbarrier.completed() and (mpi3::wall_time() - t0) < d) {}

    if(not rbarrier.completed()) {
        std::cout<< "non essential message: aborting from rank "<< comm.rank() <<" after timeout"<<std::endl;
    } else {
        std::cout<< "not essential message: aborting from rank "<< comm.rank() <<" after others join"<<std::endl;
    }

    comm.abort();
}

auto mpi3::main(int /*argc*/, char** /*argv*/, mpi3::communicator world) -> int try {

// unconditional abort
#if 0
    // (-) prints only one message, (+) program terminates immediately
    try {
        uniform_fail(world);
    } catch(std::logic_error&) {
        unconditional_abort(world);
    }
#endif

#if 0
    // (-) prints only one message, (+) program terminates immediately
    try {
        nonuniform_fail(world);  // non-uniform error
    } catch(std::logic_error& e) {
        unconditional_abort(world);
    }
#endif

// barriered abort
#if 0
    // (-) prints all available messages, (+) program terminates immediately
    try {
        uniform_fail(world);
    } catch(std::logic_error& e) {
        barriered_abort(world);
    }
#endif

#if 0
    // (+) prints all available messages, (-) it DEADLOCKS (here or later)
    try {
        nonuniform_fail(world);
    } catch(std::logic_error& e) {
        barriered_abort(world);
    }
#endif

// abort after hard sleep
#if 0
    // (+) prints all available messages, (~) program terminates after hard timeout
    try {
        uniform_fail(world);  // non-uniform error
    } catch(std::logic_error&) {
        using namespace std::chrono_literals;
        abort_after(world, 20s);
    }
#endif

#if 0
    // (+) prints all available messages, (~) program terminates after hard timeout
    try {
        nonuniform_fail(world);  // non-uniform error
    } catch(std::logic_error&) {
        using namespace std::chrono_literals;
        abort_after(world, 20s);
    }
#endif

// timedout_abort
#if 1
    // (+) prints all available messages, (+) program terminates very quickly
    try {
        uniform_fail(world);
    } catch(std::logic_error&) {
        using namespace std::chrono_literals;
        timedout_abort(world, 20s);
    }
#endif

#if 0
    // (+) prints all available messages, (~) program terminates after timeout
    try {
        nonuniform_fail(world);
    } catch(std::logic_error&) {
        using namespace std::chrono_literals;
        timedout_abort(world, 20s);
    }
#endif

    // I am putting a collective here to produce an deadlock if some abort strategy leaks processes
    {
        int n = 1;
        int total;
        world.all_reduce_n(&n, 1, &total);
        assert(total == world.size());
    }

    return 0;
}

PDoakORNL commented 2 years ago

I believe 1 is only relevant for the "ensemble" feature, which I feel @prckent can speak to better. This basically allows one to run multiple input files at once with one invocation.

The application splits the communicator and gives each mpi group one input file. Each group does its own logging and would be subject to uniform errors with the group. So a barrier that was just across that group comm is what makes sense here.

As far as 2, I thought the "contract" is that either you will call MPI_finalize on all ranks or call MPI_Abort from some rank. Sure I guess you could call terminate if you complete the iBarrier on "world" since there won't be an MPI_finalize that hangs. MPI_finalize is a collective it will hang righ? But if you complete iBarrier on a "split" comm and terminate that is going to hang the other ranks when they hit finalize right? I can see how you might want to just throw again but I think that needs to not be the UniformException since it won't be above that context.

correaa commented 2 years ago

But if you complete iBarrier on a "split" comm and terminate that is going to hang the other ranks when they hit finalize right?

yes, that is why I infer that if any of the "handler" function knows about the communicator and the caller can guess it, by any means, then the advantage is exploited better.

A default could be world:

template<class Duration>
void timedout_handler(Duration d, mpi3::communicator& comm = mpi3::environment::get_world_instance() ) {

I can see how you might want to just throw again but I think that needs to not be the UniformException since it won't be above that context.

yes, that is why I experimentally wrote mpi3_timedout_throw which keeps rethrowing with the hope that some above (al larger communicator group can also catch an wait).

What might be interesting is that in the case of timed_throw it is possible that the best default is MPI_SELF.

template<class Duration>
void timedout_throw(Duration d, mpi3::communicator& comm = mpi3::environment::get_self_instance() ) {

and catching one by one in larger contexts (communicators)

If every layer waits for 5 seconds, for example, it is not that bad, unless you have divided de communicator many many times (unlikely). The number of layer is almost log( WORLD_SIZE ) and as we all know log ~ 1. And if nobody is catching, well, one has a termination because all or some programs terminated and escaped main.

I don't know. I must admit that I have a hard time thinking about it any further without experimenting more. this is mind boggling to me.

prckent commented 2 years ago

The ensemble feature is a useful one that we intend to keep and might expand. e.g. To run multiple twists simultaneously "internally". Ideally we would have a choice in how aborts were handled - whether to keep the rest of the ensemble running or to fully abort the entire application.

correaa commented 2 years ago

yes, exactly, that discards any solution that involves calling explicitly MPI_Abort, but maybe still something else that is still synchronized. Like a timedout "rethrowing" which I find elegant for some reason.

Perhaps, the solution is as clear as surrounding the explicit task of a splited communicator into a try/catch block, if one is decided to tolerate the failure of one element of the "ensemble".

That's is why I developed the wrapper in the first place, to really try different high level patterns that are very hard to express with a C-style interface and C-style usage.

correaa commented 2 years ago

In more boring news, ibarrier() is implemented now.

It is used like this:

    mpi3::request req = world.ibarrier();

    using namespace std::literals::chrono_literals;
    std::this_thread::sleep_for(2s);

    req.wait();
    assert( req.completed() );

Note that request doesn't need to destroyed explicitly, the resource is handled automatically. This is a model for all asynchronous calls in bmpi3, in case you need others.

For what is worth, In turn this allows to implement @PDoakORNL function. The only change I made is that I think the function has the potential to be more useful if the communicator is passed or guessed.

template<class Duration>
void timedout_abort(Duration d, mpi3::communicator& comm = boost::mpi3::environment::get_world_instance() ) {
    auto rbarrier = comm.ibarrier();

    auto const t0 = mpi3::wall_time();
    while(not rbarrier.completed() and (mpi3::wall_time() - t0) < d) {}

    comm.abort();
}

I was about to add this feature to the convenience mpi3::main function, but I am not sure that it covers all the needs, even if I fix duration to something fixed, like 5seconds.

Please let me know we need something else.

correaa commented 2 years ago

As a final observation, I think a lot of headaches regarding unsynchronized or syncronized error can be handled if barriers are seen as resources.

class barrier_guard {
   mpi3::communicator& comm_;
public:
   barrier_guard(mpi3::communicator& comm) : comm_{comm} {}
   barrier_guard(barrier_guard const&) = delete;
   ~barrier_guard(){comm_.barrier();}
}

{
   barrier_guard{comm};

  ... failing or non-failing code will end in a barrier, but never skipped.

   // barrier happens here normally OR on exception unwinding 
}

If this is correct, the logic of barriers might be have been inverted since MPI 1.0, and a good barrier is a property of a scope.

Interestingly enough, maybe not using barriers at all is better than using them in the wrong place (in the context of error). Barrier seems to be an RAII-hostile function.

QMCPACK / qmcpack

For sane uniform error handling we need an MPI Barrier with timeout. #3760