Closed andlaus closed 6 years ago
My guess is that current_viewdata->logical_cartesiansize in CpGrid.hpp is faulty in that case. Either the pointer to current_viewdata is not set correctly or the logical_cartesiansize is incorrect.
Maybe this issue should be moved to ewoms as the stack trace indicates that the segfault happens in CartesianIndexMapper? Or did I miss some clue why the error is supposed to be in opm-grid?
BTW: Are your usinf Zoltan? I never got a situation where a the cells had to be moved manually after the partitioner did its work. Not sure how to replicate that.
Maybe this issue should be moved to ewoms [..]
maybe, but it rather seems to be a grid issue. (CartesianIndexMapper.hpp
is not a file that is located in eWoms.)
BTW: Are your usinf Zoltan? I never got a situation where a the cells had to be moved manually after the partitioner did its work.
I had to compile trilinos manually because I'm using openSuse tumbleweed on that machine and this distribution does not seem to ship a suitable package. I used the trilinos master from yesterday for that.
That said, I possibly screwed up with the compile options, since HAVE_ZOLTAN does not get defined in config.h
. Anyway, I think that this should not happen even without ZOLTAN being available because if it does, I'd estimate that we just were "lucky" with ZOLTAN? (at least it should produce a meaningful error message instead of a segmentation fault.)
I just recompiled with flags that cause HAVE_ZOLTAN to be defined and the same thing happens with 16 processes:
and@inferius:~/src/opm-simulators/build-cmake > grep HAVE_ZOLTAN config.h
#define HAVE_ZOLTAN 1
and@inferius:~/src/opm-simulators/build-cmake > mpirun -np 16 ./bin/flow ~/src/opm-data/norne/NORNE_ATW2013 --output-dir=.
**********************************************************************
* *
* This is flow 2018.10-pre *
* *
* Flow is a simulator for fully implicit three-phase black-oil flow, *
* including solvent and polymer capabilities. *
* For more information, see https://opm-project.org *
* *
**********************************************************************
--------------------------------------------------------------------------
mpirun noticed that process rank 4 with PID 41899 on node inferius exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
I used the trilinos master from yesterday for that.
maybe that is the problem and you should use a released version? I doubt that we have the man power to support git-master of dependencies. (I also doubt that we support all releases Zoltan versions, though). Anyway, if this is reproducible with a released version, I might find the time to look into it.
Tried to reproduce this, but failed. At least with Zoltan v3.6 (compiled with scotch support) it works on me stable Debian 8.
same on my ubuntu 16.04, using a slightly older zoltan i guess (trilinos 12.4.2 from debian science). i tried various process counts between 9 and 64 with no issues.
@andlaus the packaging for rhel i use is an altered version of opensuse packaging so there should be a spec out there on the interwebs. probably not for latest but we don't need the most shiny version here.
hmmpf: I checked out trilinos' trilinos-release-12-12-branch
branch and it still ran into the same problem: Actually, if zoltan is enabled flow
now always crashes for parallel runs when using the alternative build system and the default build system does not even manage to detect MPI on this system (the same flags work fine with the default BS on my system with a older distribution).
I have no idea how you could reproduce this, but if you send me your SSH pubkey, I'll give you access to the machine.
This segmentation fault seems to happen very early in the simulation. I would assume that a debugger would show some weired place for it. Maybe this is a hickup concerning incompatible ABIs. Whenever I had something like this at a such early stage it was something like some libs built with different compiler or incompatible options, etc. But this is pure guess work,
I also am puzzled by this. I doubt that it is due to incompatible ABIs because all of dune, OPM and (supposedly) the system's libraries have been complied using GCC 8.2.1. Also, since this only happens with CpGrid
(ALUGrid, YaspGrid, etc. are fine for the simple test problems which I have available), I'm pretty confident that the problem is caused by it. Note: I also tried this with libstdc++'s debug mode but this did not work even in sequential mode:
> gdb --args ./bin/flow --output-dir=. ~/src/opm-data/norne/NORNE_ATW2013.DATA
[...]
(gdb) r
[...]
Thread 1 "flow" received signal SIGSEGV, Segmentation fault.
0x00007ffff4fb68c4 in std::vector<boost::sub_match<char const*>, std::allocator<boost::sub_match<char const*> > >::_M_fill_insert(__gnu_cxx::__normal_iterator<boost::sub_match<char const*>*, std::vector<boost::sub_match<char const*>, std::allocator<boost::sub_match<char const*> > > >, unsigned long, boost::sub_match<char const*> const&) ()
from /usr/lib64/libboost_regex.so.1.68.0
(gdb) bt
#0 0x00007ffff4fb68c4 in std::vector<boost::sub_match<char const*>, std::allocator<boost::sub_match<char const*> > >::_M_fill_insert(__gnu_cxx::__normal_iterator<boost::sub_match<char const*>*, std::vector<boost::sub_match<char const*>, std::allocator<boost::sub_match<char const*> > > >, unsigned long, boost::sub_match<char const*> const&)
() from /usr/lib64/libboost_regex.so.1.68.0
#1 0x00007ffff4fc4f02 in boost::re_detail_106800::perl_matcher<char const*, std::allocator<boost::sub_match<char const*> >, boost::regex_traits<char, boost::cpp_regex_traits<char> > >::match_imp() () from /usr/lib64/libboost_regex.so.1.68.0
#2 0x0000000002126fd9 in boost::regex_match<char const*, std::allocator<boost::sub_match<char const*> >, char, boost::regex_traits<char, boost::cpp_regex_traits<char> > >
(first=0x7fffffffd310 "FIPNUM", last=0x7fffffffd316 "", m=..., e=..., flags=boost::regex_constants::match_any) at /usr/include/boost/regex/v4/regex_match.hpp:50
#3 0x0000000002125d3f in boost::regex_match<char const*, char, boost::regex_traits<char, boost::cpp_regex_traits<char> > > (first=0x7fffffffd310 "FIPNUM",
last=0x7fffffffd316 "", e=..., flags=boost::regex_constants::match_default) at /usr/include/boost/regex/v4/regex_match.hpp:58
#4 0x000000000212214d in Opm::ParserKeyword::matches (this=0x411fb70, name=...)
at /home/guest/src/opm-common/build-cmake/fake-src/src/opm/parser/eclipse/Parser/ParserKeyword.cpp:564
#5 0x00000000020fdd0d in Opm::Parser::matchingKeyword (this=0x7fffffffdb00, name=...)
at /home/guest/src/opm-common/build-cmake/fake-src/src/opm/parser/eclipse/Parser/Parser.cpp:691
#6 0x00000000020fde31 in Opm::Parser::isRecognizedKeyword (this=0x7fffffffdb00, name=...)
at /home/guest/src/opm-common/build-cmake/fake-src/src/opm/parser/eclipse/Parser/Parser.cpp:708
#7 0x00000000020fbfd5 in Opm::(anonymous namespace)::createRawKeyword (kw=..., parserState=..., parser=...)
at /home/guest/src/opm-common/build-cmake/fake-src/src/opm/parser/eclipse/Parser/Parser.cpp:417
#8 0x00000000020fc859 in Opm::(anonymous namespace)::tryParseKeyword (parserState=..., parser=...)
at /home/guest/src/opm-common/build-cmake/fake-src/src/opm/parser/eclipse/Parser/Parser.cpp:503
#9 0x00000000020fcbaa in Opm::(anonymous namespace)::parseState (parserState=..., parser=...)
at /home/guest/src/opm-common/build-cmake/fake-src/src/opm/parser/eclipse/Parser/Parser.cpp:550
#10 0x00000000020fdac0 in Opm::Parser::parseFile (this=0x7fffffffdb00, dataFileName="/home/guest/src/opm-data/norne/NORNE_ATW2013.DATA", parseContext=...)
at /home/guest/src/opm-common/build-cmake/fake-src/src/opm/parser/eclipse/Parser/Parser.cpp:669
#11 0x0000000001e7fecf in main (argc=3, argv=0x7fffffffe0f8) at /home/guest/src/opm-simulators/build-cmake/fake-src/examples/flow.cpp:154
(gdb)
I guess this is because I simply used the distribution package of boost, i.e., I did not go through the nightmare of recompiling boost in libstdc++'s debug mode myself.
I somehow missed what system you are using.
There seems to be a rather unfortunate bug when using a certain versions of OpenMPI in combination with Zoltan. For example this is the case for Ubuntu LTS 18.04 (OpenMPI 2.1.1-8). The problem is not there in Debian which uses other versions (stable: 2.0.2-2, testing 3.1.2-6).
So you might want to try switching to MPICH (that worked for my on Ubuntu LTS 18.04), you need to compile your own versions of all the libraries that use MPI (DUNE, zoltan, etc.).
Here is the sketch of what needs to be done:
Deinstall all DUNE packages (as they and their dependencies are linked with OpenMPI). This is needed to prevent mixing. Of course one could make sure that those are not used by building them in the source tree. But that seemed rather complicated and fragile.
Download and compile Zoltan with mpich(I used version 2.83 from their downlaod page)
./configure MPI_CC=/usr/bin/mpicc.mpich MPI_CXX=/usr/bin/mpicxx.mpich MPI_FC=/usr/bin/mpifort.mpich CC=/usr/bin/mpicc.mpich CXX=/usr/bin/mpicxx.mpich FC=/usr/bin/mpifort.mpich --prefix=$HOME/opt/zoltan-2.83-mpich --enable-mpi
make everything
make install
Compile DUNE and OPM by explicitly requesting MPICH and our version of Zoltan. Here is my CMake options file:
set(USE_MPI ON CACHE STRING "Use mpi")
set(BUILD_TESTING OFF CACHE BOOL "Build tests")
set(CMAKE_FIND_PACKAGE_NO_PACKAGE_REGISTRY 1 CACHE BOOL "" FORCE)
set(BUILD_ECL_SUMMARY ON CACHE BOOL "Build summary.x")
set(BUILD_APPLICATIONS OFF CACHE BOOL "Build applications")
set(CMAKE_BUILD_TYPE Release CACHE STRING "Build type to use")
set(CMAKE_INSTALL_PREFIX "$HOME/opt/opm/" CACHE PATH "installation directory")
set(ZOLTAN_ROOT "$HOME/opt/zoltan-3.83-mpich" CACHE STRING "Path to ZOLTAN")
set(MPI_C_COMPILER /usr/bin/mpicc.mpich CACHE STRING "gcc")
set(MPI_CXX_COMPILER /usr/bin/mpicxx.mpich CACHE STRING "gcc")
set(MPI_Fortran_COMPILER /usr/bin/mpifort.mpich CACHE STRING "gcc")
Use mpich to run: mpirun.mpich instead of mpirun
You can make it more easy by switching the default of MPI to mpich using update-alternatives --config mpi
. The you do not need to set the compilers explicitly.
the mpi version used there is:
> mpirun --version
mpirun (Open MPI) 1.10.7.0.5e373bf1fd
Report bugs to http://www.open-mpi.org/community/help/
(OpenMPI 2.x somehow did not work, but I forgot why.) ZOLTAN is self-compiled from the trilinos master at about the time this issue was opened:
~/src/trilinos|master > git log --oneline | head -n1
802b9e46b0 Merge Pull Request #3434 from prwolfe/Trilinos/add_RIG
Anyway, if this bug affects a common configuration (Ubuntu >= 18.04), I think that it needs a work-around even if it is not our fault.
Would you please try with MPICH and see whether this really fixes your problem?
that's quite a bit of effort because would I need to recompile everything-and-the-kitchen sink. I think the best way out is to call ZOLTAN without MPI awareness. also, the stuff in `ZoltanGraphFunctions.hpp' seems to be suspicious?!
BTW That seems like a rather old version (nearly as old as the one on my Debian jessie machine which uses 1.6.5). But that version at least works. Again: what system is this?
Please clarify what you think is suspicious in ZoltanGraphFunctions.hpp?
the contents of the file opm/grid/common/ZoltanGraphFunctions.hpp
: at least it messes around with the HAVE_MPI macro.
BTW That seems like a rather old version (nearly as old as the one on my Debian jessie machine which uses 1.6.5). But that version at least works. Again: what system is this?
do you mean openMPI? tumbleweed provides, openMPI 1.10.7, 2.1.4 and 3.1.1. IIRC there were some compilation issues with Dune or OPM for 2.1 and 3.1, but maybe I did something wrong. (and before you ask: there is currently only one version installed on the system.)
Maybe we should both reread your backtrace in gdb above and notice that the segementation fault happens in the parser. So this might be totally unrelated to zoltan.
IIRC there were some compilation issues with Dune or OPM for 2.1 and 3.1
Maybe this got fixed in newer versions of DUNE/OPM? Might be worth a try.
Concerning the duplicate defines, maybe we could use:
#pragma push_macro("HAVE_MPI")
#undef BLAH
#include"zoltan.h"
#define BLAH 5
#pragma pop_macro("HAVE_MPI")
Is that less messy?
this is not really what I meant: the point is rather that undefing HAVE_MPI before including a header is rather dangerous and might lead to unexpected results: the actual library might still use MPI (because it is using the traditional library/header approach), but the corresponding header thinks it is not available.
okay, with mpich-3.2.1 it does not work either, but the error message seems to be more useful:
mpirun -np 4 ./bin/flow --output-dir=. ../../opm-data/norne/NORNE_ATW2013.DATA
Reading deck file '../../opm-data/norne/NORNE_ATW2013.DATA'
**********************************************************************
* *
* This is flow 2019.04-pre *
* *
* Flow is a simulator for fully implicit three-phase black-oil flow, *
* including solvent and polymer capabilities. *
* For more information, see https://opm-project.org *
* *
**********************************************************************
Reading deck file '../../opm-data/norne/NORNE_ATW2013.DATAReading deck file '../../opm-data/norne/NORNE_ATW2013.DATA'
'
Reading deck file '../../opm-data/norne/NORNE_ATW2013.DATA'
Fatal error in MPI_Allreduce: Invalid datatype, error stack:
MPI_Allreduce(907): MPI_Allreduce(sbuf=0x7ffe0ab6065c, rbuf=0x7ffe0ab60660, count=1, INVALID DATATYPE, op=0x31, comm=0x84000004) failed
MPI_Allreduce(852): Invalid datatype
Fatal error in MPI_Allreduce: Invalid datatype, error stack:
MPI_Allreduce(907): MPI_Allreduce(sbuf=0x7ffc68a88f7c, rbuf=0x7ffc68a88f80, count=1, INVALID DATATYPE, op=0x31, comm=0x84000002) failed
MPI_Allreduce(852): Invalid datatype
Fatal error in MPI_Allreduce: Invalid datatype, error stack:
MPI_Allreduce(907): MPI_Allreduce(sbuf=0x7ffddf47f20c, rbuf=0x7ffddf47f210, count=1, INVALID DATATYPE, op=0x31, comm=0x84000002) failed
MPI_Allreduce(852): Invalid datatype
Fatal error in MPI_Allreduce: Invalid datatype, error stack:
MPI_Allreduce(907): MPI_Allreduce(sbuf=0x7fffba0738ec, rbuf=0x7fffba0738f0, count=1, INVALID DATATYPE, op=0x31, comm=0x84000002) failed
MPI_Allreduce(852): Invalid datatype
the point is rather that undefing HAVE_MPI before including a header is rather dangerous and might lead to unexpected results
Well in this case the problem is that zoltan.h defines HAVE_MPI in its own special way which would interfere with OPM/DUNE. Maybe the source code comment is not clear enough.
Anyway, this could be moved to the *.cpp file in this case which is definitely safer.
Thanks for testing. This is a great help.
Now the only thing missing is a backtrace. Maybe you could exchange the error handler with a custom throwing one? Like defined here and used here
okay, I tried to get a backtrace, but with MPICH this is not trivial because it kills all child processes once it encounters an error (openMPI sends SIGABRT). this means my debugger gets killed as soon as the error is encountered. I'll step into this manually, but this may take some time...
Maybe start valgrind in parallel to see if there is any memory violation.
we did this already (only with openmpi): nothing. is there a way to tell MPICH not to send SIGKILL to all its child processes on encountering an error?
okay, the error with mpich happens here: https://github.com/OPM/opm-grid/blob/master/opm/grid/common/ZoltanPartition.cpp#L48 . inside zoltan, the error occurs during a call to MPI_Comm_dup()
at zz_struct.c:128
As far as I can see, the only obvious thing that can go wrong from the OPM side here is that the cc
object is somehow screwed up.
Well that is a place that should just work. When calling this method cc should be converted to an MPI_Comm and in particular to MPI_COMM_WORLD. Errors at this place are really beyond my imagination and Segmentation faults in particular. I have no idea.
note that I don't get a segmentation fault with mpich, but rather the error above. I'm equally puzzled what the relation between the error message and MPI_Comm_dup()
is, though.
Something really fishy is going on with ZOLTAN and MPI: if I add a line like
mc = MPI_COMM_WORLD;
in e.g. ZoltanPartition.cpp
, I get:
(gdb) print mc
$1 = 1140850688
if I do the same inside zoltan's 'zz_struct.c', the result is
(gdb) print mc
$2 = -100
also, passing MPI_COMM_NULL directly as communicator to Zoltan_create
makes MPI really unhappy even though there are a few ifs within the function that explicitly check for this. maybe we should think about switching to a different graph partitioner.
I have three ideas:
Concerning 3: Maybe you could try moving the Zoltan_Initialize call from ZoltanPartition.cpp to FlowMainEbos and use it to initialize MPI.
- Zoltan is compiled without MPI support. There is a configure switch --enable-mpi and I am not sure what the default is
indeed: there is a TPL_ENABLE_MPI cmake flag in trilinos which defaults to OFF and I did not enable it so far. I recompiled trilinos with ON
, and flow
now seems to work fine with MPI. What I don't get is why Zoltan tries to call into MPI anyway if that flag is set to OFF. Also, it smells a lot like this is the problem with the Ubuntu 18.04 Zoltan packages.
I just checked debian/rules in the source package of Ubuntu and it seems to be configured with MPI as they are passing
-DCMAKE_C_COMPILER=mpicc -DCMAKE_CXX_COMPILER=mpicxx -DCMAKE_Fortran_COMPILER=mpif90 -DTPL_ENABLE_MPI:BOOL=ON
to cmake.
Anyway, I am glad that this is at least sorted out for your platform now.
some more data points.
i have 1) repacked the debs. same issue 2) built directly on my machine with the same flags, same issue 3) built the known-working 12.4 (ie xenial version) on my machine using the new deb build flags - same issue. 4) built the known working version with the known working old deb flags - same issue 5) built 12.12 using vanilla flags (just enabling MPI and zoltan), same issue
everything points to openmpi being broken in ubuntu 18.04 (i did not try mpich, but earlier reported working by @blattms)
it's also possible that the trilinos debian package is build against mpich while flow uses openmpi. this would also explain why @blattms reported success with mpich. did you check this?
I did check it using ldd.
I did check it using ldd.
did you check that the zoltan debian package uses mpich or did you check that it does not use it? Also, ldd only lists the libraries which are dynamically loaded at runtime, but does not say anything about which implementation's header files were used!? (IOW, since the MPI library is linked using -lmpi, both implementations could be loaded but the headers could mismatch.)
to check which implementation the trilinos debian package uses you need to look at its specfile.
@akva2 did you try to compile the trilinos master in full-manual mode (without building a package)? the cmake command which I currently use for this is:
cmake -DBUILD_SHARED_LIBS=ON -DTPL_ENABLE_MPI=ON -DTPL_ENABLE_X11=OFF -DTPL_ENABLE_Matio=OFF -DTPL_ENABLE_LAPACK=OFF -DTrilinos_ENABLE_ALL_PACKAGES=ON -DTPL_ENABLE_BLAS=OFF -DCMAKE_INSTALL_PREFIX=$(pwd)/../trilinos-install ../trilinos
i build in pbuilder, i have full control of what's around. there is no mpich on there.
ok, how about the trilinos master?
no, have not tested master, i will but in the middle of stuff right now.
no change with trilinos master.
darn! does it work with a self-compiled openmpi?
i
openmpi in bionic is indeed what's broken.
DISCLAIMER/NOTE: i had to do some hackery to avoid the mpi-default-dev|bin usage since those would not be happy with my backported openmpi packages. so it's not a 100% pure rebuild as such.
yay, nej! I've noticed that there are two openMPI packages for 18.04: libopenmpi and libopenmpi2. maybe it works with the other? (even if it works this probably won't be a real solution because trilinos depends on the wrong one :()
if I start Norne with
flow
on more than 8 processes, I get a segmentation fault on some ranks. the valgrind output for one of these is the following:this seems to be a bug in the grid because (1) it works for e.g. 8 processes, and (2) the error occurs in
CartesianIndexMapper.hpp
and debugging this withgdb
is challenging because callingcartesianDimensions()
does not work: