Open fmahebert opened 6 months ago
Looping in also @MayeulDestouches who brought this to my attention.
Thanks for bringing this to my attention. I will try to reproduce, create a test and fix asap.
Following reproducer doesn't reproduce it for me... Could you double check?
#include <iostream>
#include "eckit/runtime/Main.h"
#include "eckit/mpi/Comm.h"
#include "atlas/parallel/mpi/mpi.h"
int main(int argc, char* argv[]) {
eckit::Main::initialise(argc,argv);
int irank = eckit::mpi::comm().rank();
auto print_default_comm_name = [&] {
if (irank == 0 ) {
std::cout << "default eckit comm = " << eckit::mpi::comm().name() << std::endl;
}
};
eckit::mpi::comm().split(irank, "comm1");
eckit::mpi::comm().split(irank, "comm2");
eckit::mpi::setCommDefault("comm1");
print_default_comm_name();
{
atlas::mpi::Scope scope("comm2");
print_default_comm_name();
}
print_default_comm_name();
return 0;
}
Output:
default eckit comm = comm1
default eckit comm = comm2
default eckit comm = comm1
@wdeconinck Thanks for taking a look. I confirm I see the same (= correct) output from your code snippet. Now to figure out the difference between this case and what JEDI is doing... I'll post an update when I know more.
Thanks @fmahebert and @wdeconinck for investigating this. I've tried playing with Willem's code snippet, but I'm unable to reproduce the issue either. Apparently we've missed something on what triggers the error...
This code snippet more closely mirrors what's being done in the particular JEDI scenario that @MayeulDestouches and I were testing:
#include <iostream>
#include "eckit/runtime/Main.h"
#include "eckit/mpi/Comm.h"
#include "atlas/grid/Grid.h"
#include "atlas/grid/Partitioner.h"
#include "atlas/functionspace/StructuredColumns.h"
#include "atlas/parallel/mpi/mpi.h"
int main(int argc, char* argv[]) {
eckit::Main::initialise(argc,argv);
int irank = eckit::mpi::comm().rank();
auto print_default_comm_name = [&] {
if (irank == 0 ) {
std::cout << "default eckit comm = " << eckit::mpi::comm().name() << std::endl;
}
};
eckit::mpi::comm().split(irank, "comm1");
eckit::mpi::comm().split(irank, "comm2");
auto make_structured_columns = [] {
const atlas::Grid grid("F15");
const atlas::grid::Partitioner partitioner("equal_regions");
const atlas::functionspace::StructuredColumns sc(grid, partitioner);
};
eckit::mpi::setCommDefault("comm1");
print_default_comm_name();
make_structured_columns();
print_default_comm_name();
eckit::mpi::setCommDefault("comm2");
print_default_comm_name();
make_structured_columns();
print_default_comm_name();
return 0;
}
And the output for this is,
default eckit comm = comm1
default eckit comm = comm1
default eckit comm = comm2
default eckit comm = comm1
(where the last line was expected to be comm2...)
Something about constructing the StructuredColumns appears to be resetting the default MPI comm. We had guessed it was the mpi::Scope
in the StructuredColumns constructor, but that may have been too hasty of a guess.
@wdeconinck Can you confirm whether you see this same output and/or whether that matches your expectation?
I'm wondering if for our use-case we might be better off using the mpi_comm
config option for constructing the atlas objects, and not using the setCommDefault
. @wdeconinck do you have a recommendation?
Thank you for the new reproducer. I could simplify it a bit:
#include <iostream>
#include "eckit/runtime/Main.h"
#include "eckit/mpi/Comm.h"
#include "atlas/parallel/mpi/mpi.h"
int main(int argc, char* argv[]) {
eckit::Main::initialise(argc,argv);
int irank = eckit::mpi::comm().rank();
auto print_default_comm_name = [&] {
if (irank == 0 ) {
std::cout << "default eckit comm = " << eckit::mpi::comm().name() << std::endl;
}
};
eckit::mpi::comm().split(irank, "comm1");
eckit::mpi::comm().split(irank, "comm2");
auto make_structured_columns = [&] {
atlas::mpi::Scope scope("comm2");
};
eckit::mpi::setCommDefault("comm1");
print_default_comm_name();
make_structured_columns();
print_default_comm_name();
eckit::mpi::setCommDefault("comm2");
print_default_comm_name();
make_structured_columns();
print_default_comm_name();
return 0;
}
This is definitely unintended behaviour which I want to fix.
What happened?
The outermost atlas
mpi::Scope
does not restore the prior default eckit MPI communicator when it goes out of scope.See code snippet below.
I suspect the problem is this:
Scope
destructor callsCommStack::pop
pop
decrementssize_
but then usesname()
, which evaluatesstack_[size_-1];
Scope
, this is no longer a valid indexingPerhaps one solution would be to
push
the current eckit default communicator when creating an outermostScope
object? Basically, ifsize_ == 0
, then push the current default THEN push the new name?What are the steps to reproduce the bug?
This code...
prints out...
Whereas I would expect the last output to be
default eckit comm = comm1
.Version
0.36
Platform (OS and architecture)
Linux x86_64
Relevant log output
No response
Accompanying data
No response
Organisation
JCSDA