segmentation fault on Norne for > 8 processes

andlaus commented 6 years ago

if I start Norne with flow on more than 8 processes, I get a segmentation fault on some ranks. the valgrind output for one of these is the following:

> mpirun -np 9 valgrind ./bin/flow NORNE_ATW2013
[...]
==12932== Memcheck, a memory error detector
==12932== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==12932== Using Valgrind-3.13.0 and LibVEX; rerun with -h for copyright info
==12932== Command: ./bin/flow /home/and/src/opm-data/norne/NORNE_ATW2013 --output-dir=.
==12932== 
Manually moving well C-4H to partition 0
[...]
Manually moving well D-3BH to partition 0
==12932== Invalid read of size 4
==12932==    at 0x1A90775: Dune::CartesianIndexMapper<Dune::CpGrid>::computeCartesianSize() const (CartesianIndexMapper.hpp:24)
==12932==    by 0x1A90656: Dune::CartesianIndexMapper<Dune::CpGrid>::CartesianIndexMapper(Dune::CpGrid const&) (CartesianIndexMapper.hpp:33)
==12932==    by 0x1A88B7D: Ewoms::EclCpGridVanguard<Ewoms::Properties::TTag::EclFlowProblem>::loadBalance() (eclcpgridvanguard.hh:204)
==12932==    by 0x1A8763C: Ewoms::Simulator<Ewoms::Properties::TTag::EclFlowProblem>::Simulator(bool) (simulator.hh:132)
==12932==    by 0x19EDAB0: Opm::FlowMainEbos<Ewoms::Properties::TTag::EclFlowProblem>::setupEbosSimulator() (FlowMainEbos.hpp:437)
==12932==    by 0x19D569C: Opm::FlowMainEbos<Ewoms::Properties::TTag::EclFlowProblem>::execute(int, char**) (FlowMainEbos.hpp:211)
==12932==    by 0x19A9FBD: Opm::flowEbosBlackoilMain(int, char**) (flow_ebos_blackoil.cpp:59)
==12932==    by 0x1512552: main (flow.cpp:216)
==12932==  Address 0xa8 is not stack'd, malloc'd or (recently) free'd
==12932== 
==12932== 
==12932== Process terminating with default action of signal 11 (SIGSEGV): dumping core
==12932==  Access not within mapped region at address 0xA8
==12932==    at 0x1A90775: Dune::CartesianIndexMapper<Dune::CpGrid>::computeCartesianSize() const (CartesianIndexMapper.hpp:24)
==12932==    by 0x1A90656: Dune::CartesianIndexMapper<Dune::CpGrid>::CartesianIndexMapper(Dune::CpGrid const&) (CartesianIndexMapper.hpp:33)
==12932==    by 0x1A88B7D: Ewoms::EclCpGridVanguard<Ewoms::Properties::TTag::EclFlowProblem>::loadBalance() (eclcpgridvanguard.hh:204)
==12932==    by 0x1A8763C: Ewoms::Simulator<Ewoms::Properties::TTag::EclFlowProblem>::Simulator(bool) (simulator.hh:132)
==12932==    by 0x19EDAB0: Opm::FlowMainEbos<Ewoms::Properties::TTag::EclFlowProblem>::setupEbosSimulator() (FlowMainEbos.hpp:437)
==12932==    by 0x19D569C: Opm::FlowMainEbos<Ewoms::Properties::TTag::EclFlowProblem>::execute(int, char**) (FlowMainEbos.hpp:211)
==12932==    by 0x19A9FBD: Opm::flowEbosBlackoilMain(int, char**) (flow_ebos_blackoil.cpp:59)
==12932==    by 0x1512552: main (flow.cpp:216)
==12932==  If you believe this happened as a result of a stack
==12932==  overflow in your program's main thread (unlikely but
==12932==  possible), you can try to increase the size of the
==12932==  main thread stack using the --main-stacksize= flag.
==12932==  The main thread stack size used in this run was 8388608.
==12932== 
==12932== HEAP SUMMARY:
==12932==     in use at exit: 184,070,185 bytes in 764,903 blocks
==12932==   total heap usage: 2,383,073 allocs, 1,618,170 frees, 772,582,716 bytes allocated
==12932== 
==12932== LEAK SUMMARY:
==12932==    definitely lost: 3,619,665 bytes in 34,822 blocks
==12932==    indirectly lost: 933,865 bytes in 142,951 blocks
==12932==      possibly lost: 5,903 bytes in 38 blocks
==12932==    still reachable: 179,510,752 bytes in 587,092 blocks
==12932==         suppressed: 0 bytes in 0 blocks
==12932== Rerun with --leak-check=full to see details of leaked memory
==12932== 
==12932== For counts of detected and suppressed errors, rerun with: -v
==12932== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 0 from 0)

this seems to be a bug in the grid because (1) it works for e.g. 8 processes, and (2) the error occurs in CartesianIndexMapper.hpp and debugging this with gdb is challenging because calling cartesianDimensions() does not work:

(gdb) print this
$2 = (const Dune::CartesianIndexMapper<Dune::CpGrid> *) 0x45461e0
(gdb) print cartesianDimensions()
$3 = (const std::array<int, 3> &) <error reading variable>

dr-robertk commented 6 years ago

My guess is that current_viewdata->logical_cartesiansize in CpGrid.hpp is faulty in that case. Either the pointer to current_viewdata is not set correctly or the logical_cartesiansize is incorrect.

blattms commented 6 years ago

Maybe this issue should be moved to ewoms as the stack trace indicates that the segfault happens in CartesianIndexMapper? Or did I miss some clue why the error is supposed to be in opm-grid?

blattms commented 6 years ago

BTW: Are your usinf Zoltan? I never got a situation where a the cells had to be moved manually after the partitioner did its work. Not sure how to replicate that.

andlaus commented 6 years ago

Maybe this issue should be moved to ewoms [..]

maybe, but it rather seems to be a grid issue. (CartesianIndexMapper.hpp is not a file that is located in eWoms.)

BTW: Are your usinf Zoltan? I never got a situation where a the cells had to be moved manually after the partitioner did its work.

I had to compile trilinos manually because I'm using openSuse tumbleweed on that machine and this distribution does not seem to ship a suitable package. I used the trilinos master from yesterday for that.

That said, I possibly screwed up with the compile options, since HAVE_ZOLTAN does not get defined in config.h. Anyway, I think that this should not happen even without ZOLTAN being available because if it does, I'd estimate that we just were "lucky" with ZOLTAN? (at least it should produce a meaningful error message instead of a segmentation fault.)

andlaus commented 6 years ago

I just recompiled with flags that cause HAVE_ZOLTAN to be defined and the same thing happens with 16 processes:

and@inferius:~/src/opm-simulators/build-cmake > grep HAVE_ZOLTAN config.h 
#define HAVE_ZOLTAN 1
and@inferius:~/src/opm-simulators/build-cmake > mpirun -np 16 ./bin/flow ~/src/opm-data/norne/NORNE_ATW2013 --output-dir=.
**********************************************************************
*                                                                    *
*                      This is flow 2018.10-pre                      *
*                                                                    *
* Flow is a simulator for fully implicit three-phase black-oil flow, *
*             including solvent and polymer capabilities.            *
*          For more information, see https://opm-project.org          *
*                                                                    *
**********************************************************************

--------------------------------------------------------------------------
mpirun noticed that process rank 4 with PID 41899 on node inferius exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

blattms commented 6 years ago

I used the trilinos master from yesterday for that.

maybe that is the problem and you should use a released version? I doubt that we have the man power to support git-master of dependencies. (I also doubt that we support all releases Zoltan versions, though). Anyway, if this is reproducible with a released version, I might find the time to look into it.

blattms commented 6 years ago

Tried to reproduce this, but failed. At least with Zoltan v3.6 (compiled with scotch support) it works on me stable Debian 8.

akva2 commented 6 years ago

same on my ubuntu 16.04, using a slightly older zoltan i guess (trilinos 12.4.2 from debian science). i tried various process counts between 9 and 64 with no issues.

@andlaus the packaging for rhel i use is an altered version of opensuse packaging so there should be a spec out there on the interwebs. probably not for latest but we don't need the most shiny version here.

andlaus commented 6 years ago

hmmpf: I checked out trilinos' trilinos-release-12-12-branch branch and it still ran into the same problem: Actually, if zoltan is enabled flow now always crashes for parallel runs when using the alternative build system and the default build system does not even manage to detect MPI on this system (the same flags work fine with the default BS on my system with a older distribution).

I have no idea how you could reproduce this, but if you send me your SSH pubkey, I'll give you access to the machine.

blattms commented 6 years ago

This segmentation fault seems to happen very early in the simulation. I would assume that a debugger would show some weired place for it. Maybe this is a hickup concerning incompatible ABIs. Whenever I had something like this at a such early stage it was something like some libs built with different compiler or incompatible options, etc. But this is pure guess work,

andlaus commented 6 years ago

I also am puzzled by this. I doubt that it is due to incompatible ABIs because all of dune, OPM and (supposedly) the system's libraries have been complied using GCC 8.2.1. Also, since this only happens with CpGrid (ALUGrid, YaspGrid, etc. are fine for the simple test problems which I have available), I'm pretty confident that the problem is caused by it. Note: I also tried this with libstdc++'s debug mode but this did not work even in sequential mode:

> gdb --args ./bin/flow --output-dir=. ~/src/opm-data/norne/NORNE_ATW2013.DATA
[...]
(gdb) r
[...]
Thread 1 "flow" received signal SIGSEGV, Segmentation fault.
0x00007ffff4fb68c4 in std::vector<boost::sub_match<char const*>, std::allocator<boost::sub_match<char const*> > >::_M_fill_insert(__gnu_cxx::__normal_iterator<boost::sub_match<char const*>*, std::vector<boost::sub_match<char const*>, std::allocator<boost::sub_match<char const*> > > >, unsigned long, boost::sub_match<char const*> const&) ()
   from /usr/lib64/libboost_regex.so.1.68.0
(gdb) bt
#0  0x00007ffff4fb68c4 in std::vector<boost::sub_match<char const*>, std::allocator<boost::sub_match<char const*> > >::_M_fill_insert(__gnu_cxx::__normal_iterator<boost::sub_match<char const*>*, std::vector<boost::sub_match<char const*>, std::allocator<boost::sub_match<char const*> > > >, unsigned long, boost::sub_match<char const*> const&)
    () from /usr/lib64/libboost_regex.so.1.68.0
#1  0x00007ffff4fc4f02 in boost::re_detail_106800::perl_matcher<char const*, std::allocator<boost::sub_match<char const*> >, boost::regex_traits<char, boost::cpp_regex_traits<char> > >::match_imp() () from /usr/lib64/libboost_regex.so.1.68.0
#2  0x0000000002126fd9 in boost::regex_match<char const*, std::allocator<boost::sub_match<char const*> >, char, boost::regex_traits<char, boost::cpp_regex_traits<char> > >
    (first=0x7fffffffd310 "FIPNUM", last=0x7fffffffd316 "", m=..., e=..., flags=boost::regex_constants::match_any) at /usr/include/boost/regex/v4/regex_match.hpp:50
#3  0x0000000002125d3f in boost::regex_match<char const*, char, boost::regex_traits<char, boost::cpp_regex_traits<char> > > (first=0x7fffffffd310 "FIPNUM", 
    last=0x7fffffffd316 "", e=..., flags=boost::regex_constants::match_default) at /usr/include/boost/regex/v4/regex_match.hpp:58
#4  0x000000000212214d in Opm::ParserKeyword::matches (this=0x411fb70, name=...)
    at /home/guest/src/opm-common/build-cmake/fake-src/src/opm/parser/eclipse/Parser/ParserKeyword.cpp:564
#5  0x00000000020fdd0d in Opm::Parser::matchingKeyword (this=0x7fffffffdb00, name=...)
    at /home/guest/src/opm-common/build-cmake/fake-src/src/opm/parser/eclipse/Parser/Parser.cpp:691
#6  0x00000000020fde31 in Opm::Parser::isRecognizedKeyword (this=0x7fffffffdb00, name=...)
    at /home/guest/src/opm-common/build-cmake/fake-src/src/opm/parser/eclipse/Parser/Parser.cpp:708
#7  0x00000000020fbfd5 in Opm::(anonymous namespace)::createRawKeyword (kw=..., parserState=..., parser=...)
    at /home/guest/src/opm-common/build-cmake/fake-src/src/opm/parser/eclipse/Parser/Parser.cpp:417
#8  0x00000000020fc859 in Opm::(anonymous namespace)::tryParseKeyword (parserState=..., parser=...)
    at /home/guest/src/opm-common/build-cmake/fake-src/src/opm/parser/eclipse/Parser/Parser.cpp:503
#9  0x00000000020fcbaa in Opm::(anonymous namespace)::parseState (parserState=..., parser=...)
    at /home/guest/src/opm-common/build-cmake/fake-src/src/opm/parser/eclipse/Parser/Parser.cpp:550
#10 0x00000000020fdac0 in Opm::Parser::parseFile (this=0x7fffffffdb00, dataFileName="/home/guest/src/opm-data/norne/NORNE_ATW2013.DATA", parseContext=...)
    at /home/guest/src/opm-common/build-cmake/fake-src/src/opm/parser/eclipse/Parser/Parser.cpp:669
#11 0x0000000001e7fecf in main (argc=3, argv=0x7fffffffe0f8) at /home/guest/src/opm-simulators/build-cmake/fake-src/examples/flow.cpp:154
(gdb)

I guess this is because I simply used the distribution package of boost, i.e., I did not go through the nightmare of recompiling boost in libstdc++'s debug mode myself.

blattms commented 6 years ago

I somehow missed what system you are using.

There seems to be a rather unfortunate bug when using a certain versions of OpenMPI in combination with Zoltan. For example this is the case for Ubuntu LTS 18.04 (OpenMPI 2.1.1-8). The problem is not there in Debian which uses other versions (stable: 2.0.2-2, testing 3.1.2-6).

So you might want to try switching to MPICH (that worked for my on Ubuntu LTS 18.04), you need to compile your own versions of all the libraries that use MPI (DUNE, zoltan, etc.).

Here is the sketch of what needs to be done:

Deinstall all DUNE packages (as they and their dependencies are linked with OpenMPI). This is needed to prevent mixing. Of course one could make sure that those are not used by building them in the source tree. But that seemed rather complicated and fragile.

Download and compile Zoltan with mpich(I used version 2.83 from their downlaod page)

./configure MPI_CC=/usr/bin/mpicc.mpich MPI_CXX=/usr/bin/mpicxx.mpich MPI_FC=/usr/bin/mpifort.mpich CC=/usr/bin/mpicc.mpich CXX=/usr/bin/mpicxx.mpich FC=/usr/bin/mpifort.mpich --prefix=$HOME/opt/zoltan-2.83-mpich --enable-mpi
make everything
make install

Compile DUNE and OPM by explicitly requesting MPICH and our version of Zoltan. Here is my CMake options file:

set(USE_MPI ON         CACHE STRING "Use mpi")
set(BUILD_TESTING OFF CACHE BOOL "Build tests")
set(CMAKE_FIND_PACKAGE_NO_PACKAGE_REGISTRY 1 CACHE BOOL "" FORCE)
set(BUILD_ECL_SUMMARY ON CACHE BOOL "Build summary.x")
set(BUILD_APPLICATIONS OFF CACHE BOOL "Build applications")
set(CMAKE_BUILD_TYPE Release CACHE STRING "Build type to use")
set(CMAKE_INSTALL_PREFIX "$HOME/opt/opm/" CACHE PATH "installation directory")
set(ZOLTAN_ROOT "$HOME/opt/zoltan-3.83-mpich" CACHE STRING "Path to ZOLTAN")
set(MPI_C_COMPILER /usr/bin/mpicc.mpich CACHE STRING "gcc")
set(MPI_CXX_COMPILER /usr/bin/mpicxx.mpich CACHE STRING "gcc")
set(MPI_Fortran_COMPILER /usr/bin/mpifort.mpich  CACHE STRING "gcc")

Use mpich to run: mpirun.mpich instead of mpirun

You can make it more easy by switching the default of MPI to mpich using update-alternatives --config mpi. The you do not need to set the compilers explicitly.

andlaus commented 6 years ago

the mpi version used there is:

> mpirun --version
mpirun (Open MPI) 1.10.7.0.5e373bf1fd

Report bugs to http://www.open-mpi.org/community/help/

(OpenMPI 2.x somehow did not work, but I forgot why.) ZOLTAN is self-compiled from the trilinos master at about the time this issue was opened:

~/src/trilinos|master > git log --oneline | head -n1
802b9e46b0 Merge Pull Request #3434 from prwolfe/Trilinos/add_RIG

Anyway, if this bug affects a common configuration (Ubuntu >= 18.04), I think that it needs a work-around even if it is not our fault.

blattms commented 6 years ago

Would you please try with MPICH and see whether this really fixes your problem?

andlaus commented 6 years ago

that's quite a bit of effort because would I need to recompile everything-and-the-kitchen sink. I think the best way out is to call ZOLTAN without MPI awareness. also, the stuff in `ZoltanGraphFunctions.hpp' seems to be suspicious?!

blattms commented 6 years ago

BTW That seems like a rather old version (nearly as old as the one on my Debian jessie machine which uses 1.6.5). But that version at least works. Again: what system is this?

blattms commented 6 years ago

Please clarify what you think is suspicious in ZoltanGraphFunctions.hpp?

andlaus commented 6 years ago

the contents of the file opm/grid/common/ZoltanGraphFunctions.hpp: at least it messes around with the HAVE_MPI macro.

andlaus commented 6 years ago

BTW That seems like a rather old version (nearly as old as the one on my Debian jessie machine which uses 1.6.5). But that version at least works. Again: what system is this?

do you mean openMPI? tumbleweed provides, openMPI 1.10.7, 2.1.4 and 3.1.1. IIRC there were some compilation issues with Dune or OPM for 2.1 and 3.1, but maybe I did something wrong. (and before you ask: there is currently only one version installed on the system.)

blattms commented 6 years ago

Maybe we should both reread your backtrace in gdb above and notice that the segementation fault happens in the parser. So this might be totally unrelated to zoltan.

blattms commented 6 years ago

IIRC there were some compilation issues with Dune or OPM for 2.1 and 3.1

Maybe this got fixed in newer versions of DUNE/OPM? Might be worth a try.

blattms commented 6 years ago

Concerning the duplicate defines, maybe we could use:

#pragma push_macro("HAVE_MPI")
#undef BLAH
#include"zoltan.h"
#define BLAH 5
#pragma pop_macro("HAVE_MPI")

Is that less messy?

andlaus commented 6 years ago

this is not really what I meant: the point is rather that undefing HAVE_MPI before including a header is rather dangerous and might lead to unexpected results: the actual library might still use MPI (because it is using the traditional library/header approach), but the corresponding header thinks it is not available.

andlaus commented 6 years ago

okay, with mpich-3.2.1 it does not work either, but the error message seems to be more useful:

mpirun -np 4 ./bin/flow --output-dir=. ../../opm-data/norne/NORNE_ATW2013.DATA
Reading deck file '../../opm-data/norne/NORNE_ATW2013.DATA'
**********************************************************************
*                                                                    *
*                      This is flow 2019.04-pre                      *
*                                                                    *
* Flow is a simulator for fully implicit three-phase black-oil flow, *
*             including solvent and polymer capabilities.            *
*          For more information, see https://opm-project.org         *
*                                                                    *
**********************************************************************

Reading deck file '../../opm-data/norne/NORNE_ATW2013.DATAReading deck file '../../opm-data/norne/NORNE_ATW2013.DATA'
'
Reading deck file '../../opm-data/norne/NORNE_ATW2013.DATA'
Fatal error in MPI_Allreduce: Invalid datatype, error stack:
MPI_Allreduce(907): MPI_Allreduce(sbuf=0x7ffe0ab6065c, rbuf=0x7ffe0ab60660, count=1, INVALID DATATYPE, op=0x31, comm=0x84000004) failed
MPI_Allreduce(852): Invalid datatype
Fatal error in MPI_Allreduce: Invalid datatype, error stack:
MPI_Allreduce(907): MPI_Allreduce(sbuf=0x7ffc68a88f7c, rbuf=0x7ffc68a88f80, count=1, INVALID DATATYPE, op=0x31, comm=0x84000002) failed
MPI_Allreduce(852): Invalid datatype
Fatal error in MPI_Allreduce: Invalid datatype, error stack:
MPI_Allreduce(907): MPI_Allreduce(sbuf=0x7ffddf47f20c, rbuf=0x7ffddf47f210, count=1, INVALID DATATYPE, op=0x31, comm=0x84000002) failed
MPI_Allreduce(852): Invalid datatype
Fatal error in MPI_Allreduce: Invalid datatype, error stack:
MPI_Allreduce(907): MPI_Allreduce(sbuf=0x7fffba0738ec, rbuf=0x7fffba0738f0, count=1, INVALID DATATYPE, op=0x31, comm=0x84000002) failed
MPI_Allreduce(852): Invalid datatype

blattms commented 6 years ago

the point is rather that undefing HAVE_MPI before including a header is rather dangerous and might lead to unexpected results

Well in this case the problem is that zoltan.h defines HAVE_MPI in its own special way which would interfere with OPM/DUNE. Maybe the source code comment is not clear enough.

Anyway, this could be moved to the *.cpp file in this case which is definitely safer.

blattms commented 6 years ago

Thanks for testing. This is a great help.

Now the only thing missing is a backtrace. Maybe you could exchange the error handler with a custom throwing one? Like defined here and used here

andlaus commented 6 years ago

okay, I tried to get a backtrace, but with MPICH this is not trivial because it kills all child processes once it encounters an error (openMPI sends SIGABRT). this means my debugger gets killed as soon as the error is encountered. I'll step into this manually, but this may take some time...

dr-robertk commented 6 years ago

Maybe start valgrind in parallel to see if there is any memory violation.

andlaus commented 6 years ago

we did this already (only with openmpi): nothing. is there a way to tell MPICH not to send SIGKILL to all its child processes on encountering an error?

andlaus commented 6 years ago

okay, the error with mpich happens here: https://github.com/OPM/opm-grid/blob/master/opm/grid/common/ZoltanPartition.cpp#L48 . inside zoltan, the error occurs during a call to MPI_Comm_dup() at zz_struct.c:128

As far as I can see, the only obvious thing that can go wrong from the OPM side here is that the cc object is somehow screwed up.

blattms commented 6 years ago

Well that is a place that should just work. When calling this method cc should be converted to an MPI_Comm and in particular to MPI_COMM_WORLD. Errors at this place are really beyond my imagination and Segmentation faults in particular. I have no idea.

andlaus commented 6 years ago

note that I don't get a segmentation fault with mpich, but rather the error above. I'm equally puzzled what the relation between the error message and MPI_Comm_dup() is, though.

andlaus commented 6 years ago

Something really fishy is going on with ZOLTAN and MPI: if I add a line like

mc = MPI_COMM_WORLD;

in e.g. ZoltanPartition.cpp, I get:

(gdb) print mc
$1 = 1140850688

if I do the same inside zoltan's 'zz_struct.c', the result is

(gdb) print mc
$2 = -100

also, passing MPI_COMM_NULL directly as communicator to Zoltan_create makes MPI really unhappy even though there are a few ifs within the function that explicitly check for this. maybe we should think about switching to a different graph partitioner.

blattms commented 6 years ago

I have three ideas:

Zoltan is compiled without MPI support. There is a configure switch --enable-mpi and I am not sure what the default is
Zoltan is configured and compiled with other MPI version than OPM
Zoltan_init does not get that MPI is already initialized and make another MPI_Init call. According to the documentation this should not happen, but you never know.

Concerning 3: Maybe you could try moving the Zoltan_Initialize call from ZoltanPartition.cpp to FlowMainEbos and use it to initialize MPI.

andlaus commented 6 years ago

Zoltan is compiled without MPI support. There is a configure switch --enable-mpi and I am not sure what the default is

indeed: there is a TPL_ENABLE_MPI cmake flag in trilinos which defaults to OFF and I did not enable it so far. I recompiled trilinos with ON, and flow now seems to work fine with MPI. What I don't get is why Zoltan tries to call into MPI anyway if that flag is set to OFF. Also, it smells a lot like this is the problem with the Ubuntu 18.04 Zoltan packages.

blattms commented 6 years ago

I just checked debian/rules in the source package of Ubuntu and it seems to be configured with MPI as they are passing

-DCMAKE_C_COMPILER=mpicc -DCMAKE_CXX_COMPILER=mpicxx -DCMAKE_Fortran_COMPILER=mpif90 -DTPL_ENABLE_MPI:BOOL=ON

to cmake.

Anyway, I am glad that this is at least sorted out for your platform now.

akva2 commented 6 years ago

some more data points.

i have 1) repacked the debs. same issue 2) built directly on my machine with the same flags, same issue 3) built the known-working 12.4 (ie xenial version) on my machine using the new deb build flags - same issue. 4) built the known working version with the known working old deb flags - same issue 5) built 12.12 using vanilla flags (just enabling MPI and zoltan), same issue

everything points to openmpi being broken in ubuntu 18.04 (i did not try mpich, but earlier reported working by @blattms)

andlaus commented 6 years ago

it's also possible that the trilinos debian package is build against mpich while flow uses openmpi. this would also explain why @blattms reported success with mpich. did you check this?

blattms commented 6 years ago

I did check it using ldd.

andlaus commented 6 years ago

I did check it using ldd.

did you check that the zoltan debian package uses mpich or did you check that it does not use it? Also, ldd only lists the libraries which are dynamically loaded at runtime, but does not say anything about which implementation's header files were used!? (IOW, since the MPI library is linked using -lmpi, both implementations could be loaded but the headers could mismatch.)

to check which implementation the trilinos debian package uses you need to look at its specfile.

@akva2 did you try to compile the trilinos master in full-manual mode (without building a package)? the cmake command which I currently use for this is:

cmake -DBUILD_SHARED_LIBS=ON -DTPL_ENABLE_MPI=ON -DTPL_ENABLE_X11=OFF -DTPL_ENABLE_Matio=OFF -DTPL_ENABLE_LAPACK=OFF -DTrilinos_ENABLE_ALL_PACKAGES=ON -DTPL_ENABLE_BLAS=OFF -DCMAKE_INSTALL_PREFIX=$(pwd)/../trilinos-install ../trilinos

akva2 commented 6 years ago

i build in pbuilder, i have full control of what's around. there is no mpich on there.

andlaus commented 6 years ago

ok, how about the trilinos master?

akva2 commented 6 years ago

no, have not tested master, i will but in the middle of stuff right now.

akva2 commented 6 years ago

no change with trilinos master.

andlaus commented 6 years ago

darn! does it work with a self-compiled openmpi?

akva2 commented 6 years ago

i

backported openmpi dependencies from 18.10
backported openmpi from 18.10
rebuilt dune packages
rebuilt scotch packages
built trilinos on the system (from bionic sources) - so many mpi enabled dependencies so i can't stomach the rebuild of the packages
built opm
ran norne with 2/4 procs
smiled, yet cried.

openmpi in bionic is indeed what's broken.

DISCLAIMER/NOTE: i had to do some hackery to avoid the mpi-default-dev|bin usage since those would not be happy with my backported openmpi packages. so it's not a 100% pure rebuild as such.

andlaus commented 6 years ago

yay, nej! I've noticed that there are two openMPI packages for 18.04: libopenmpi and libopenmpi2. maybe it works with the other? (even if it works this probably won't be a real solution because trilinos depends on the wrong one :()

OPM / opm-grid

segmentation fault on Norne for > 8 processes #344