GEOS-DEV / GEOS

GEOS Simulation Framework
GNU Lesser General Public License v2.1
222 stars 89 forks source link

OMP-enabled GEOS terminates abnormally with a SIGSEGV #2827

Closed drmichaeltcvx closed 8 months ago

drmichaeltcvx commented 11 months ago

Describe the bug OMP-enabled GEOS terminates abnormally with a SIGSEGV while simulating model "./SPE10_refined.xml" off [https://github.com/GEOS-DEV/MAELSTROM/tree/master/usecases/francois/SPE10/flow] when more than 1 OMP threads are active.

To Reproduce Steps to reproduce the behavior:

  1. Run GEOS with

OMP_NUM_THREADS=20 mpirun -np 1 $(which geosx) -i ./SPE10_refined.xml \ -t runtime-report,max_column_width=200,calc.inclusive,mpi-report -x 1 -y 1 -z 1

  1. After some time it will generate
...
compflow: Max relative pressure change during time step: 4.598 %
compflow: Max absolute phase volume fraction change during time step: 0.052
compflow: Time-step required will be increased based on state change.
Time: 2.58e+04 s, dt: 8163.258750212592 s, Cycle: 4
    Attempt:  0, ConfigurationIter:  0, NewtonIter:  0
        ( Rflow ) = ( 5.96e-02 )        ( R ) = ( 5.96e-02 )
        MGR preconditioner: numComponentsPerField = [3]
        Linear Solver | Success | Iterations: 22 | Final Rel Res: 0.00064163 | Make Restrictor Time: 0 | Compute Auu Time: 0 | SC Filter Time: 0 | Setup Time: 45.5268 s | Solve Time: 217.744 s
        compflow: Max pressure change: 1575956.141 Pa (before scaling)
        compflow: Max component density change: 45.239 kg/m3 (before scaling)
        compflow: Global solution scaling factor = 1
compflow: Max deltaPhaseVolFrac = 0.045105931221652684
    Attempt:  0, ConfigurationIter:  0, NewtonIter:  1
        ( Rflow ) = ( 1.57e-02 )        ( R ) = ( 1.57e-02 )
        Last LinSolve(iter,res) = (  22, 6.42e-04 )
        MGR preconditioner: numComponentsPerField = [3]
        Linear Solver | Success | Iterations: 16 | Final Rel Res: 0.000614216 | Make Restrictor Time: 0 | Compute Auu Time: 0 | SC Filter Time: 0 | Setup Time: 45.7587 s | Solve Time: 152.552 s
        compflow: Max pressure change: 119510.797 Pa (before scaling)
        compflow: Max component density change: 2.287 kg/m3 (before scaling)
        compflow: Global solution scaling factor = 1
compflow: Max deltaPhaseVolFrac = 0.0022841038896992405
    Attempt:  0, ConfigurationIter:  0, NewtonIter:  2
        ( Rflow ) = ( 7.33e-05 )        ( R ) = ( 7.33e-05 )
        Last LinSolve(iter,res) = (  16, 6.14e-04 )
compflow: Max relative pressure change during time step: 4.554 %
compflow: Max absolute phase volume fraction change during time step: 0.047
compflow: Time-step required will be increased based on state change.
Time: 3.39e+04 s, dt: 11077.291615478814 s, Cycle: 5
    Attempt:  0, ConfigurationIter:  0, NewtonIter:  0
        ( Rflow ) = ( 5.63e-02 )        ( R ) = ( 5.63e-02 )
        MGR preconditioner: numComponentsPerField = [3]
        Linear Solver | Success | Iterations: 27 | Final Rel Res: 0.000747275 | Make Restrictor Time: 0 | Compute Auu Time: 0 | SC Filter Time: 0 | Setup Time: 45.8097 s | Solve Time: 275.458 s
        compflow: Max pressure change: 1567308.382 Pa (before scaling)
        compflow: Max component density change: 47.543 kg/m3 (before scaling)
        compflow: Global solution scaling factor = 1
compflow: Max deltaPhaseVolFrac = 0.04744675929757902
    Attempt:  0, ConfigurationIter:  0, NewtonIter:  1
        ( Rflow ) = ( 1.75e-02 )        ( R ) = ( 1.75e-02 )
        Last LinSolve(iter,res) = (  27, 7.47e-04 )
        MGR preconditioner: numComponentsPerField = [3]
        Linear Solver | Success | Iterations: 17 | Final Rel Res: 0.00090827 | Make Restrictor Time: 0 | Compute Auu Time: 0 | SC Filter Time: 0 | Setup Time: 45.9779 s | Solve Time: 162.657 s
        compflow: Max pressure change: 143952.444 Pa (before scaling)
        compflow: Max component density change: 2.073 kg/m3 (before scaling)
        compflow: Global solution scaling factor = 1
compflow: Max deltaPhaseVolFrac = 0.0020719133774779186
    Attempt:  0, ConfigurationIter:  0, NewtonIter:  2
        ( Rflow ) = ( 6.98e-05 )        ( R ) = ( 6.98e-05 )
        Last LinSolve(iter,res) = (  17, 9.08e-04 )
compflow: Max relative pressure change during time step: 4.310 %
compflow: Max absolute phase volume fraction change during time step: 0.047
compflow: Time-step required will be increased based on state change.
Time: 4.50e+04 s, dt: 15052.402670751115 s, Cycle: 6
    Attempt:  0, ConfigurationIter:  0, NewtonIter:  0
        ( Rflow ) = ( 6.06e-02 )        ( R ) = ( 6.06e-02 )
        MGR preconditioner: numComponentsPerField = [3]

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x2aab053e4700 (LWP 74345)]
hypre_SeqVectorSetConstantValuesHost._omp_fn.0 () at vector.c:340
340 vector.c: No such file or directory.
Missing separate debuginfos, use: debuginfo-install blas-3.4.2-8.el7.x86_64 bzip2-libs-1.0.6-13.el7.x86_64 elfutils-libelf-0.176-5.el7.x86_64 elfutils-libs-0.176-5.el7.x86_64 glibc-2.17-326.el7_9.x86_64 lapack-3.4.2-8.el7.x86_64 libattr-2.4.46-13.el7.x86_64 libcap-2.22-11.el7.x86_64 libgfortran-4.8.5-44.el7.x86_64 libibverbs-58mlnx43-1.58203.x86_64 libnl3-3.2.28-4.el7.x86_64 librdmacm-58mlnx43-1.58203.x86_64 libxpmem-2.6.4-1.58203.rhel7u9.x86_64 numactl-libs-2.0.12-5.el7.x86_64 sssd-client-1.16.5-10.el7_9.15.x86_64 systemd-libs-219-78.el7_9.7.x86_64 xz-libs-5.2.2-2.el7_9.x86_64
(gdb)  where
#0  hypre_SeqVectorSetConstantValuesHost._omp_fn.0 () at vector.c:340
#1  0x00002aaab6ec8a86 in gomp_thread_start (xdata=<optimized out>) at ../../../libgomp/team.c:123
#2  0x00002aaab617dea5 in start_thread () from /lib64/libpthread.so.0
#3  0x00002aaab7403b0d in clone () from /lib64/libc.so.6
  1. In a debugger we can see :
    #0  hypre_SeqVectorSetConstantValuesHost._omp_fn.0 () at vector.c:340
    #1  0x00002aaab6ec8a86 in gomp_thread_start (xdata=<optimized out>) at ../../../libgomp/team.c:123
    #2  0x00002aaab617dea5 in start_thread () from /lib64/libpthread.so.0
    #3  0x00002aaab7403b0d in clone () from /lib64/libc.so.6
    (gdb) directory /home/mtml/src/GEOS/thirdPartyLibs/
    .git/           .gitattributes  .github/        .gitignore      .gitmodules     CMakeLists.txt  cmake/          docker/         host-configs/   scripts/        tpl.cpp         tplMirror/      
    (gdb) !which geosx
    /data/saet/mtml/software/x86_64/RHEL7/GEOS/0.2.0/install-CPU-OPTO1-Hypre-GCC_10.2.0-ompi_hpcx-OMP-relwithdebinfo/bin/geosx
    (gdb) list
    335 in vector.c
    (gdb) info threads
    Id   Target Id         Frame 
    * 8    Thread 0x2aab053e4700 (LWP 74345) "geosx" hypre_SeqVectorSetConstantValuesHost._omp_fn.0 () at vector.c:340
    7    Thread 0x2aaae19d0700 (LWP 74344) "geosx" hypre_SeqVectorSetConstantValuesHost._omp_fn.0 () at vector.c:340
    6    Thread 0x2aaae095d700 (LWP 74343) "geosx" futex_wait (val=455624, addr=0xfab934) at ../../../libgomp/config/linux/x86/futex.h:44
    5    Thread 0x2aaad7348700 (LWP 74321) "async" 0x00002aaab74040e3 in epoll_wait () from /lib64/libc.so.6
    4    Thread 0x2aaad57eb700 (LWP 74318) "fuse" 0x00002aaab618475d in read () from /lib64/libpthread.so.0
    3    Thread 0x2aaaca54a700 (LWP 74307) "geosx" 0x00002aaab74040e3 in epoll_wait () from /lib64/libc.so.6
    2    Thread 0x2aaac7534700 (LWP 74277) "geosx" 0x00002aaab74040e3 in epoll_wait () from /lib64/libc.so.6
    1    Thread 0x2aaaaab44ec0 (LWP 74014) "geosx" futex_wait (val=455624, addr=0xfab934) at ../../../libgomp/config/linux/x86/futex.h:44
    (gdb) print
    `
    When setting a breakpoint at the function that fails we get:
    `Breakpoint 1, hypre_SeqVectorSetConstantValuesHost (v=0x1099950, value=value@entry=0) at vector.c:328
    328 vector.c: No such file or directory.
    (gdb) where
    #0  hypre_SeqVectorSetConstantValuesHost (v=0x1099950, value=value@entry=0) at vector.c:328
    #1  0x00002aaaae97e4f8 in hypre_SeqVectorSetConstantValues (v=<optimized out>, value=value@entry=0) at vector.c:378
    #2  0x00002aaaae966c7e in hypre_ParVectorSetConstantValues (v=<optimized out>, value=value@entry=0) at par_vector.c:327
    #3  0x00002aaaac8e3613 in geos::HypreVector::create (this=this@entry=0xe61520, localSize=<optimized out>, comm=<optimized out>) at /dev/shm/mtml/src/GEOS/GEOS/src/coreComponents/linearAlgebra/interfaces/hypre/HypreVector.cpp:115
    #4  0x00002aaaacb3a844 in geos::SolverBase::setupSystem (this=0xe612b0, domain=..., dofManager=..., localMatrix=..., rhs=..., solution=..., setSparsity=true) at /dev/shm/mtml/src/GEOS/GEOS/src/coreComponents/physicsSolvers/SolverBase.cpp:1078
    #5  0x00002aaaacb3566f in geos::SolverBase::solverStep (this=0xe612b0, time_n=@0x7fffffff4998: 0, dt=@0x7fffffff4988: 10000, cycleNumber=0, domain=...) at /dev/shm/mtml/src/GEOS/GEOS/src/coreComponents/physicsSolvers/SolverBase.cpp:218
    #6  0x00002aaaacb36bcd in geos::SolverBase::execute (this=0xe612b0, time_n=0, dt=10000, cycleNumber=0, domain=...) at /dev/shm/mtml/src/GEOS/GEOS/src/coreComponents/physicsSolvers/SolverBase.cpp:251
    #7  0x00002aaaae7fd870 in geos::EventBase::execute (this=0xe48200, time_n=0, dt=10000, cycleNumber=0, domain=...) at /dev/shm/mtml/src/GEOS/GEOS/src/coreComponents/events/EventBase.cpp:233
    #8  0x00002aaaae801dc2 in geos::EventManager::run (this=this@entry=0xe36510, domain=...) at /dev/shm/mtml/src/GEOS/GEOS/src/coreComponents/events/EventManager.cpp:193
    #9  0x00002aaaae814692 in geos::ProblemManager::runSimulation (this=<optimized out>) at /dev/shm/mtml/src/GEOS/GEOS/src/coreComponents/mainInterface/ProblemManager.cpp:1081
    #10 0x00002aaaae811223 in geos::GeosxState::run (this=this@entry=0x7fffffff52a0) at /dev/shm/mtml/src/GEOS/GEOS/src/coreComponents/mainInterface/GeosxState.cpp:177
    #11 0x000000000040b65f in main (argc=<optimized out>, argv=0x7fffffff55a8) at /dev/shm/mtml/src/GEOS/GEOS/src/main/main.cpp:46

A minimal test cases is just running the "SPE10_refined.xml" model off [https://github.com/GEOS-DEV/MAELSTROM/tree/master/usecases/francois/SPE10/flow]

Expected behavior The model is supposed to be able to run to completion.

Screenshots If applicable, add screenshots to help explain your problem.

Platform (please complete the following information):

Additional context Add any other context about the problem here.

drmichaeltcvx commented 11 months ago

Note that the OMP-enabled GEOS runs to completion with this model when only 1 OMP thread is active (OMP_NUM_THREADS=1)

drmichaeltcvx commented 8 months ago

We are no longer running into this issue.