ecmwf / atlas

A library for numerical weather prediction and climate modelling
https://sites.ecmwf.int/docs/atlas
Apache License 2.0
112 stars 42 forks source link

Outermost mpi::Scope does not restore prior communicator #186

Open fmahebert opened 6 months ago

fmahebert commented 6 months ago

What happened?

The outermost atlas mpi::Scope does not restore the prior default eckit MPI communicator when it goes out of scope.

See code snippet below.

I suspect the problem is this:

Perhaps one solution would be to push the current eckit default communicator when creating an outermost Scope object? Basically, if size_ == 0, then push the current default THEN push the new name?

What are the steps to reproduce the bug?

This code...

eckit::mpi::addComm("comm1", 123);  // as if from fortran
eckit::mpi::addComm("comm2", 124);  // as if from fortran
eckit::mpi::setCommDefault("comm1");
std::cout << "default eckit comm = " << eckit::mpi::comm().name() << std::endl;
{
  atlas::mpi::Scope scope("comm2");
  std::cout << "default eckit comm = " << eckit::mpi::comm().name() << std::endl;
}
std::cout << "default eckit comm = " << eckit::mpi::comm().name() << std::endl;

prints out...

default eckit comm = comm1
default eckit comm = comm2
default eckit comm = world

Whereas I would expect the last output to be default eckit comm = comm1.

Version

0.36

Platform (OS and architecture)

Linux x86_64

Relevant log output

No response

Accompanying data

No response

Organisation

JCSDA

fmahebert commented 6 months ago

Looping in also @MayeulDestouches who brought this to my attention.

wdeconinck commented 6 months ago

Thanks for bringing this to my attention. I will try to reproduce, create a test and fix asap.

wdeconinck commented 6 months ago

Following reproducer doesn't reproduce it for me... Could you double check?


#include <iostream>

#include "eckit/runtime/Main.h"
#include "eckit/mpi/Comm.h"

#include "atlas/parallel/mpi/mpi.h"

int main(int argc, char* argv[]) {
    eckit::Main::initialise(argc,argv);
    int irank = eckit::mpi::comm().rank();
    auto print_default_comm_name = [&] {
       if (irank == 0 ) {
           std::cout << "default eckit comm = " << eckit::mpi::comm().name() << std::endl;
       }
    };
    eckit::mpi::comm().split(irank, "comm1");
    eckit::mpi::comm().split(irank, "comm2");
    eckit::mpi::setCommDefault("comm1");

    print_default_comm_name();
    {
        atlas::mpi::Scope scope("comm2");
        print_default_comm_name();
    }
    print_default_comm_name();
    return 0;
}

Output:

default eckit comm = comm1
default eckit comm = comm2
default eckit comm = comm1
fmahebert commented 6 months ago

@wdeconinck Thanks for taking a look. I confirm I see the same (= correct) output from your code snippet. Now to figure out the difference between this case and what JEDI is doing... I'll post an update when I know more.

MayeulDestouches commented 6 months ago

Thanks @fmahebert and @wdeconinck for investigating this. I've tried playing with Willem's code snippet, but I'm unable to reproduce the issue either. Apparently we've missed something on what triggers the error...

fmahebert commented 6 months ago

This code snippet more closely mirrors what's being done in the particular JEDI scenario that @MayeulDestouches and I were testing:

#include <iostream>

#include "eckit/runtime/Main.h"
#include "eckit/mpi/Comm.h"

#include "atlas/grid/Grid.h"
#include "atlas/grid/Partitioner.h"
#include "atlas/functionspace/StructuredColumns.h"
#include "atlas/parallel/mpi/mpi.h"

int main(int argc, char* argv[]) {
    eckit::Main::initialise(argc,argv);

    int irank = eckit::mpi::comm().rank();
    auto print_default_comm_name = [&] {
       if (irank == 0 ) {
           std::cout << "default eckit comm = " << eckit::mpi::comm().name() << std::endl;
       }
    };

    eckit::mpi::comm().split(irank, "comm1");
    eckit::mpi::comm().split(irank, "comm2");

    auto make_structured_columns = [] {
        const atlas::Grid grid("F15");
        const atlas::grid::Partitioner partitioner("equal_regions");
        const atlas::functionspace::StructuredColumns sc(grid, partitioner);
    };

    eckit::mpi::setCommDefault("comm1");
    print_default_comm_name();
    make_structured_columns();
    print_default_comm_name();

    eckit::mpi::setCommDefault("comm2");
    print_default_comm_name();
    make_structured_columns();
    print_default_comm_name();

    return 0;
}

And the output for this is,

default eckit comm = comm1
default eckit comm = comm1
default eckit comm = comm2
default eckit comm = comm1

(where the last line was expected to be comm2...)

Something about constructing the StructuredColumns appears to be resetting the default MPI comm. We had guessed it was the mpi::Scope in the StructuredColumns constructor, but that may have been too hasty of a guess.

@wdeconinck Can you confirm whether you see this same output and/or whether that matches your expectation?

fmahebert commented 6 months ago

I'm wondering if for our use-case we might be better off using the mpi_comm config option for constructing the atlas objects, and not using the setCommDefault. @wdeconinck do you have a recommendation?

wdeconinck commented 6 months ago

Thank you for the new reproducer. I could simplify it a bit:

#include <iostream>

#include "eckit/runtime/Main.h"
#include "eckit/mpi/Comm.h"

#include "atlas/parallel/mpi/mpi.h"

int main(int argc, char* argv[]) {
    eckit::Main::initialise(argc,argv);
    int irank = eckit::mpi::comm().rank();
    auto print_default_comm_name = [&] {
       if (irank == 0 ) {
           std::cout << "default eckit comm = " << eckit::mpi::comm().name() << std::endl;
       }
    };
    eckit::mpi::comm().split(irank, "comm1");
    eckit::mpi::comm().split(irank, "comm2");

    auto make_structured_columns = [&] {
       atlas::mpi::Scope scope("comm2");
   };

    eckit::mpi::setCommDefault("comm1");
    print_default_comm_name();
    make_structured_columns();
    print_default_comm_name();

    eckit::mpi::setCommDefault("comm2");
    print_default_comm_name();
    make_structured_columns();
    print_default_comm_name();

    return 0;
}

This is definitely unintended behaviour which I want to fix.