OPM / opm-simulators

Simulator programs and utilities for automatic differentiation.
http://www.opm-project.org
GNU General Public License v3.0
111 stars 121 forks source link

Simulation freezes #5494

Open lisajulia opened 1 month ago

lisajulia commented 1 month ago

I wrote a test for sth different and on Jenkins, the simulation froze: Datafile: https://github.com/lisajulia/opm-tests/blob/8b84e28bd63d705ec659976a789cbc5cd7f0a80a/actionx/ACTIONX_COMPDAT_SHORT.DATA Log file from Jenkins with frozen simulation, ended by a timeout then: https://ci.opm-project.org/job/opm-simulators-PR-builder/6452/testReport/junit/(root)/mpi/compareSeparateECLFiles_flow_actionx_compdat_8_procs/

Flow compiled with the following commits: opm-common: d075bc889ead20424c695382a077275ddb1c66a3 opm-models: 29582a9f59feec1c9d04286977ab6adef89b12e3 opm-grid: bc501ad7f48676918c594d0c8dd42c405958f758 opm-simulators: ed5f371133976fd58cc98b6354553be59ca32e2b

I ran flow on 8 processes.

In case this error is gone when testing this again, please close!

blattms commented 1 month ago

Sigh, one of my most favorite parallel deadlocks in opm-flow:

7 processes in

(gdb) bt
#0  0x00007ffdcd906f94 in ?? ()
   from /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_btl_vader.so
#1  0x00007fffeac21e1c in opal_progress ()
   from /lib/x86_64-linux-gnu/libopen-pal.so.40
#2  0x00007fffee512bc5 in ompi_request_default_wait ()
   from /lib/x86_64-linux-gnu/libmpi.so.40
#3  0x00007fffee56e35b in ompi_coll_base_sendrecv_actual ()
   from /lib/x86_64-linux-gnu/libmpi.so.40
#4  0x00007fffee56f9e0 in ompi_coll_base_allreduce_intra_recursivedoubling ()
   from /lib/x86_64-linux-gnu/libmpi.so.40
#5  0x00007ffdcd80d8eb in ompi_coll_tuned_allreduce_intra_dec_fixed ()
   from /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_coll_tuned.so
#6  0x00007fffee52a31a in PMPI_Allreduce ()
   from /lib/x86_64-linux-gnu/libmpi.so.40
#7  0x0000555555b61a50 in Dune::Communication<ompi_communicator_t*>::allreduce<Dune::Max<Opm::ExceptionType::ExcEnum>, Opm::ExceptionType::ExcEnum> (
    this=0x7fffffffbc80, in=0x7fffffffbc9c, out=0x7fffffffbc6c, len=1)
    at /usr/include/dune/common/parallel/mpicommunication.hh:457
#8  0x0000555555b5df25 in Dune::Communication<ompi_communicator_t*>::max<Opm::ExceptionType::ExcEnum> (this=0x7fffffffbc80, 
    in=@0x7fffffffbc9c: Opm::ExceptionType::NONE)
    at /usr/include/dune/common/parallel/mpicommunication.hh:253
#9  0x0000555555c51b48 in (anonymous namespace)::_throw (
    exc_type=Opm::ExceptionType::NONE, 
    message="BlackoilWellModel::initializeWellState() failed: ", comm=...)
    at .../opm-simulators/opm/simulators/utils/DeferredLoggingErrorHelpers.hpp:77
#10 0x0000555555c7b473 in checkForExceptionsAndThrow (
    exc_type=Opm::ExceptionType::NONE, 
    message="BlackoilWellModel::initializeWellState() failed: ", comm=...)
    at .../opm-simulators/opm/simulators/utils/DeferredLoggingErrorHelpers.hpp:108
#11 0x0000555555d61d51 in Opm::BlackoilWellModel<Opm::Properties::TTag::FlowProblemTPFA>::initializeWellState (this=0x55555ca17248, timeStepIdx=6)
    at .../opm-simulators/opm/simulators/wells/BlackoilWellModel_impl.hpp:835
#12 0x0000555555d32a6d in Opm::BlackoilWellModel<Opm::Properties::TTag::FlowProblemTPFA>::initializeLocalWellStructure (this=0x55555ca17248, reportStepIdx=6, 
    enableWellPIScaling=true)
    at .../opm-simulators/opm/simulators/wells/BlackoilWellModel_impl.hpp:327
#13 0x0000555555cff11a in Opm::BlackoilWellModel<Opm::Properties::TTag::FlowProblemTPFA>::beginReportStep (this=0x55555ca17248, timeStepIdx=6)
    at .../opm-simulators/opm/simulators/wells/BlackoilWellModel_impl.hpp:270
#14 0x0000555555d673e1 in Opm::BlackoilWellModel<Opm::Properties::TTag::FlowProblemTPFA>::beginEpisode (this=0x55555ca17248)
    at .../opm-simulators/opm/simulators/wells/BlackoilWellModel.hpp:204
#15 0x0000555555d36c33 in Opm::FlowProblem<Opm::Properties::TTag::FlowProblemTPFA>::beginEpisode (this=0x55555ca16460)
    at .../opm-simulators/opm/simulators/flow/FlowProblem.hpp:566
#16 0x0000555555cffdf3 in Opm::BlackoilModel<Opm::Properties::TTag::FlowProblemTPFA>::beginReportStep (this=0x55555babef60)
    at .../opm-simulators/opm/simulators/flow/BlackoilModel.hpp:1175
#17 0x0000555555ccbf23 in Opm::SimulatorFullyImplicitBlackoil<Opm::Properties::TTag::FlowProblemTPFA>::runStep (this=0x55555d58dd20, timer=...)
    at .../opm-simulators/opm/simulators/flow/SimulatorFullyImplicitBlackoil.hpp:373
#18 0x0000555555cb7545 in Opm::SimulatorFullyImplicitBlackoil<Opm::Properties::TTag::FlowProblemTPFA>::run (this=0x55555d58dd20, timer=...)
    at .../opm-simulators/opm/simulators/flow/SimulatorFullyImplicitBlackoil.hpp:268
#19 0x0000555555ca0a6b in Opm::FlowMain<Opm::Properties::TTag::FlowProblemTPFA>::runSimulatorRunCallback_ (this=0x7fffffffcf70)
    at .../opm-simulators/opm/simulators/flow/FlowMain.hpp:484
...

and 1 threw an unexpected exception:

#9  0x0000555555c7b5db in logAndCheckForExceptionsAndThrow (
    deferred_logger=..., exc_type=Opm::ExceptionType::RUNTIME_ERROR, 
    message="Failed to initialize local well structure: [.../opm-simulators/opm/simulators/wells/ParallelWellInfo.cpp:708] Cells with these i,j,k indices were not found in grid (well = PROD3)"..., 
    terminal_output=false, comm=...)
    at .../opm-simulators/opm/simulators/utils/DeferredLoggingErrorHelpers.hpp:121
121         Opm::DeferredLogger global_deferredLogger = gatherDeferredLogger(deferred_logger, comm);
(gdb) 
#10 0x0000555555d32be0 in Opm::BlackoilWellModel<Opm::Properties::TTag::FlowProblemTPFA>::initializeLocalWellStructure (this=0x55555d3283b8, reportStepIdx=6, 
    enableWellPIScaling=true)
    at .../opm-simulators/opm/simulators/wells/BlackoilWellModel_impl.hpp:342
342             OPM_END_PARALLEL_TRY_CATCH_LOG(local_deferredLogger,
(gdb) bt
#0  0x00007fffeac798b5 in ?? () from /lib/x86_64-linux-gnu/libopen-pal.so.40
#1  0x00007fffeac21ce7 in ?? () from /lib/x86_64-linux-gnu/libopen-pal.so.40
#2  0x00007fffeac21e74 in opal_progress ()
   from /lib/x86_64-linux-gnu/libopen-pal.so.40
#3  0x00007fffee512bc5 in ompi_request_default_wait ()
   from /lib/x86_64-linux-gnu/libmpi.so.40
#4  0x00007fffee56e35b in ompi_coll_base_sendrecv_actual ()
   from /lib/x86_64-linux-gnu/libmpi.so.40
#5  0x00007fffee56caf3 in ompi_coll_base_allgather_intra_recursivedoubling ()
   from /lib/x86_64-linux-gnu/libmpi.so.40
#6  0x00007ffdcd80eb1a in ompi_coll_tuned_allgather_intra_dec_fixed ()
   from /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_coll_tuned.so
#7  0x00007fffee529437 in PMPI_Allgather ()
   from /lib/x86_64-linux-gnu/libmpi.so.40
#8  0x00007ffff6792f2b in Opm::gatherDeferredLogger (local_deferredlogger=..., 
    mpi_communicator=...)
    at .../opm-simulators/opm/simulators/utils/gatherDeferredLogger.cpp:145
#9  0x0000555555c7b5db in logAndCheckForExceptionsAndThrow (
    deferred_logger=..., exc_type=Opm::ExceptionType::RUNTIME_ERROR, 
    message="Failed to initialize local well structure: [.../opm-simulators/opm/simulators/wells/ParallelWellInfo.cpp:708] Cells with these i,j,k indices were not found in grid (well = PROD3)"..., 
    terminal_output=false, comm=...)
    at .../opm-simulators/opm/simulators/utils/DeferredLoggingErrorHelpers.hpp:121
#10 0x0000555555d32be0 in Opm::BlackoilWellModel<Opm::Properties::TTag::FlowProblemTPFA>::initializeLocalWellStructure (this=0x55555d3283b8, reportStepIdx=6, 
    enableWellPIScaling=true)
    at .../opm-simulators/opm/simulators/wells/BlackoilWellModel_impl.hpp:342
#11 0x0000555555cff11a in Opm::BlackoilWellModel<Opm::Properties::TTag::FlowProblemTPFA>::beginReportStep (this=0x55555d3283b8, timeStepIdx=6)
    at .../opm-simulators/opm/simulators/wells/BlackoilWellModel_impl.hpp:270
#12 0x0000555555d673e1 in Opm::BlackoilWellModel<Opm::Properties::TTag::FlowProblemTPFA>::beginEpisode (this=0x55555d3283b8)
    at .../opm-simulators/opm/simulators/wells/BlackoilWellModel.hpp:204
#13 0x0000555555d36c33 in Opm::FlowProblem<Opm::Properties::TTag::FlowProblemTPFA>::beginEpisode (this=0x55555d3275d0)
    at .../opm-simulators/opm/simulators/flow/FlowProblem.hpp:566
#14 0x0000555555cffdf3 in Opm::BlackoilModel<Opm::Properties::TTag::FlowProblemTPFA>::beginReportStep (this=0x55555babef60)
    at .../opm-simulators/opm/simulators/flow/BlackoilModel.hpp:1175
#15 0x0000555555ccbf23 in Opm::SimulatorFullyImplicitBlackoil<Opm::Properties::TTag::FlowProblemTPFA>::runStep (this=0x55555d4a8f90, timer=...)
    at .../opm-simulators/opm/simulators/flow/SimulatorFullyImplicitBlackoil.hpp:373
#16 0x0000555555cb7545 in Opm::SimulatorFullyImplicitBlackoil<Opm::Properties::TTag::FlowProblemTPFA>::run (this=0x55555d4a8f90, timer=...)
    at .../opm-simulators/opm/simulators/flow/SimulatorFullyImplicitBlackoil.hpp:268
#17 0x0000555555ca0a6b in Opm::FlowMain<Opm::Properties::TTag::FlowProblemTPFA>::runSimulatorRunCallback_ (this=0x7fffffffcf70)
    at .../opm-simulators/opm/simulators/flow/FlowMain.hpp:48

I think this due to COMPDAT in ACTIONX. Outside of ACTIONX this check is performed on process 0 and the cell is known there. Now this is performed on the parallel loadbalanced grid (without the our futureComletions) and the cell is maybe on another process?

The real problem here is that our simulator should fail gracefully and not deadlock even without your upcoming PR #5488 which is closing this.

lisajulia commented 1 month ago

Sigh, one of my most favorite parallel deadlocks in opm-flow:

7 processes in

(gdb) bt
#0  0x00007ffdcd906f94 in ?? ()
   from /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_btl_vader.so
#1  0x00007fffeac21e1c in opal_progress ()
   from /lib/x86_64-linux-gnu/libopen-pal.so.40
#2  0x00007fffee512bc5 in ompi_request_default_wait ()
   from /lib/x86_64-linux-gnu/libmpi.so.40
#3  0x00007fffee56e35b in ompi_coll_base_sendrecv_actual ()
   from /lib/x86_64-linux-gnu/libmpi.so.40
#4  0x00007fffee56f9e0 in ompi_coll_base_allreduce_intra_recursivedoubling ()
   from /lib/x86_64-linux-gnu/libmpi.so.40
#5  0x00007ffdcd80d8eb in ompi_coll_tuned_allreduce_intra_dec_fixed ()
   from /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_coll_tuned.so
#6  0x00007fffee52a31a in PMPI_Allreduce ()
   from /lib/x86_64-linux-gnu/libmpi.so.40
#7  0x0000555555b61a50 in Dune::Communication<ompi_communicator_t*>::allreduce<Dune::Max<Opm::ExceptionType::ExcEnum>, Opm::ExceptionType::ExcEnum> (
    this=0x7fffffffbc80, in=0x7fffffffbc9c, out=0x7fffffffbc6c, len=1)
    at /usr/include/dune/common/parallel/mpicommunication.hh:457
#8  0x0000555555b5df25 in Dune::Communication<ompi_communicator_t*>::max<Opm::ExceptionType::ExcEnum> (this=0x7fffffffbc80, 
    in=@0x7fffffffbc9c: Opm::ExceptionType::NONE)
    at /usr/include/dune/common/parallel/mpicommunication.hh:253
#9  0x0000555555c51b48 in (anonymous namespace)::_throw (
    exc_type=Opm::ExceptionType::NONE, 
    message="BlackoilWellModel::initializeWellState() failed: ", comm=...)
    at .../opm-simulators/opm/simulators/utils/DeferredLoggingErrorHelpers.hpp:77
#10 0x0000555555c7b473 in checkForExceptionsAndThrow (
    exc_type=Opm::ExceptionType::NONE, 
    message="BlackoilWellModel::initializeWellState() failed: ", comm=...)
    at .../opm-simulators/opm/simulators/utils/DeferredLoggingErrorHelpers.hpp:108
#11 0x0000555555d61d51 in Opm::BlackoilWellModel<Opm::Properties::TTag::FlowProblemTPFA>::initializeWellState (this=0x55555ca17248, timeStepIdx=6)
    at .../opm-simulators/opm/simulators/wells/BlackoilWellModel_impl.hpp:835
#12 0x0000555555d32a6d in Opm::BlackoilWellModel<Opm::Properties::TTag::FlowProblemTPFA>::initializeLocalWellStructure (this=0x55555ca17248, reportStepIdx=6, 
    enableWellPIScaling=true)
    at .../opm-simulators/opm/simulators/wells/BlackoilWellModel_impl.hpp:327
#13 0x0000555555cff11a in Opm::BlackoilWellModel<Opm::Properties::TTag::FlowProblemTPFA>::beginReportStep (this=0x55555ca17248, timeStepIdx=6)
    at .../opm-simulators/opm/simulators/wells/BlackoilWellModel_impl.hpp:270
#14 0x0000555555d673e1 in Opm::BlackoilWellModel<Opm::Properties::TTag::FlowProblemTPFA>::beginEpisode (this=0x55555ca17248)
    at .../opm-simulators/opm/simulators/wells/BlackoilWellModel.hpp:204
#15 0x0000555555d36c33 in Opm::FlowProblem<Opm::Properties::TTag::FlowProblemTPFA>::beginEpisode (this=0x55555ca16460)
    at .../opm-simulators/opm/simulators/flow/FlowProblem.hpp:566
#16 0x0000555555cffdf3 in Opm::BlackoilModel<Opm::Properties::TTag::FlowProblemTPFA>::beginReportStep (this=0x55555babef60)
    at .../opm-simulators/opm/simulators/flow/BlackoilModel.hpp:1175
#17 0x0000555555ccbf23 in Opm::SimulatorFullyImplicitBlackoil<Opm::Properties::TTag::FlowProblemTPFA>::runStep (this=0x55555d58dd20, timer=...)
    at .../opm-simulators/opm/simulators/flow/SimulatorFullyImplicitBlackoil.hpp:373
#18 0x0000555555cb7545 in Opm::SimulatorFullyImplicitBlackoil<Opm::Properties::TTag::FlowProblemTPFA>::run (this=0x55555d58dd20, timer=...)
    at .../opm-simulators/opm/simulators/flow/SimulatorFullyImplicitBlackoil.hpp:268
#19 0x0000555555ca0a6b in Opm::FlowMain<Opm::Properties::TTag::FlowProblemTPFA>::runSimulatorRunCallback_ (this=0x7fffffffcf70)
    at .../opm-simulators/opm/simulators/flow/FlowMain.hpp:484
...

and 1 threw an unexpected exception:

#9  0x0000555555c7b5db in logAndCheckForExceptionsAndThrow (
    deferred_logger=..., exc_type=Opm::ExceptionType::RUNTIME_ERROR, 
    message="Failed to initialize local well structure: [.../opm-simulators/opm/simulators/wells/ParallelWellInfo.cpp:708] Cells with these i,j,k indices were not found in grid (well = PROD3)"..., 
    terminal_output=false, comm=...)
    at .../opm-simulators/opm/simulators/utils/DeferredLoggingErrorHelpers.hpp:121
121         Opm::DeferredLogger global_deferredLogger = gatherDeferredLogger(deferred_logger, comm);
(gdb) 
#10 0x0000555555d32be0 in Opm::BlackoilWellModel<Opm::Properties::TTag::FlowProblemTPFA>::initializeLocalWellStructure (this=0x55555d3283b8, reportStepIdx=6, 
    enableWellPIScaling=true)
    at .../opm-simulators/opm/simulators/wells/BlackoilWellModel_impl.hpp:342
342             OPM_END_PARALLEL_TRY_CATCH_LOG(local_deferredLogger,
(gdb) bt
#0  0x00007fffeac798b5 in ?? () from /lib/x86_64-linux-gnu/libopen-pal.so.40
#1  0x00007fffeac21ce7 in ?? () from /lib/x86_64-linux-gnu/libopen-pal.so.40
#2  0x00007fffeac21e74 in opal_progress ()
   from /lib/x86_64-linux-gnu/libopen-pal.so.40
#3  0x00007fffee512bc5 in ompi_request_default_wait ()
   from /lib/x86_64-linux-gnu/libmpi.so.40
#4  0x00007fffee56e35b in ompi_coll_base_sendrecv_actual ()
   from /lib/x86_64-linux-gnu/libmpi.so.40
#5  0x00007fffee56caf3 in ompi_coll_base_allgather_intra_recursivedoubling ()
   from /lib/x86_64-linux-gnu/libmpi.so.40
#6  0x00007ffdcd80eb1a in ompi_coll_tuned_allgather_intra_dec_fixed ()
   from /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_coll_tuned.so
#7  0x00007fffee529437 in PMPI_Allgather ()
   from /lib/x86_64-linux-gnu/libmpi.so.40
#8  0x00007ffff6792f2b in Opm::gatherDeferredLogger (local_deferredlogger=..., 
    mpi_communicator=...)
    at .../opm-simulators/opm/simulators/utils/gatherDeferredLogger.cpp:145
#9  0x0000555555c7b5db in logAndCheckForExceptionsAndThrow (
    deferred_logger=..., exc_type=Opm::ExceptionType::RUNTIME_ERROR, 
    message="Failed to initialize local well structure: [.../opm-simulators/opm/simulators/wells/ParallelWellInfo.cpp:708] Cells with these i,j,k indices were not found in grid (well = PROD3)"..., 
    terminal_output=false, comm=...)
    at .../opm-simulators/opm/simulators/utils/DeferredLoggingErrorHelpers.hpp:121
#10 0x0000555555d32be0 in Opm::BlackoilWellModel<Opm::Properties::TTag::FlowProblemTPFA>::initializeLocalWellStructure (this=0x55555d3283b8, reportStepIdx=6, 
    enableWellPIScaling=true)
    at .../opm-simulators/opm/simulators/wells/BlackoilWellModel_impl.hpp:342
#11 0x0000555555cff11a in Opm::BlackoilWellModel<Opm::Properties::TTag::FlowProblemTPFA>::beginReportStep (this=0x55555d3283b8, timeStepIdx=6)
    at .../opm-simulators/opm/simulators/wells/BlackoilWellModel_impl.hpp:270
#12 0x0000555555d673e1 in Opm::BlackoilWellModel<Opm::Properties::TTag::FlowProblemTPFA>::beginEpisode (this=0x55555d3283b8)
    at .../opm-simulators/opm/simulators/wells/BlackoilWellModel.hpp:204
#13 0x0000555555d36c33 in Opm::FlowProblem<Opm::Properties::TTag::FlowProblemTPFA>::beginEpisode (this=0x55555d3275d0)
    at .../opm-simulators/opm/simulators/flow/FlowProblem.hpp:566
#14 0x0000555555cffdf3 in Opm::BlackoilModel<Opm::Properties::TTag::FlowProblemTPFA>::beginReportStep (this=0x55555babef60)
    at .../opm-simulators/opm/simulators/flow/BlackoilModel.hpp:1175
#15 0x0000555555ccbf23 in Opm::SimulatorFullyImplicitBlackoil<Opm::Properties::TTag::FlowProblemTPFA>::runStep (this=0x55555d4a8f90, timer=...)
    at .../opm-simulators/opm/simulators/flow/SimulatorFullyImplicitBlackoil.hpp:373
#16 0x0000555555cb7545 in Opm::SimulatorFullyImplicitBlackoil<Opm::Properties::TTag::FlowProblemTPFA>::run (this=0x55555d4a8f90, timer=...)
    at .../opm-simulators/opm/simulators/flow/SimulatorFullyImplicitBlackoil.hpp:268
#17 0x0000555555ca0a6b in Opm::FlowMain<Opm::Properties::TTag::FlowProblemTPFA>::runSimulatorRunCallback_ (this=0x7fffffffcf70)
    at .../opm-simulators/opm/simulators/flow/FlowMain.hpp:48

I think this due to COMPDAT in ACTIONX. Outside of ACTIONX this check is performed on process 0 and the cell is known there. Now this is performed on the parallel loadbalanced grid (without the our futureComletions) and the cell is maybe on another process?

The real problem here is that our simulator should fail gracefully and not deadlock even without your upcoming PR #5488 which is closing this.

Yes, true, then the simulator should stop. I'd suggest to keep this issue open then and we can have a look later, since this is reproducible and debuggable with the commit ids.