FEniCS / dolfinx

Next generation FEniCS problem solving environment
https://fenicsproject.org
GNU Lesser General Public License v3.0
734 stars 178 forks source link

[BUG]: Segmentation Violation when using multiple instances of NewtonSolver #3162

Closed dajuno closed 4 months ago

dajuno commented 4 months ago

Summarize the issue

Using multiple instances of NewtonSolver in a script may fail with a PETSc Segmentation violation. The MWE below fails in approx 50% of the runs.

Tested both on docker and with the spack build @ cbb04f311

How to reproduce the bug

Run the following MWE, adapted from the Cahn-Hilliard demo, multiple times.

Minimal Example (Python)

from mpi4py import MPI
from petsc4py import PETSc

import numpy as np
import ufl
from basix.ufl import element, mixed_element
from dolfinx import default_real_type
from dolfinx.fem import Function, functionspace
from dolfinx.fem.petsc import NonlinearProblem
from dolfinx.mesh import CellType, create_unit_square
from dolfinx.nls.petsc import NewtonSolver
from ufl import dx, grad, inner

lmbda = 1.0e-02  # surface parameter
dt = 5.0e-06  # time step
theta = 0.5

msh = create_unit_square(MPI.COMM_WORLD, 96, 96, CellType.triangle)
P1 = element("Lagrange", msh.basix_cell(), 1)
ME = functionspace(msh, mixed_element([P1, P1]))

q, v = ufl.TestFunctions(ME)
u = Function(ME)  # current solution
u0 = Function(ME)  # solution from previous converged step

# Split mixed functions
c, mu = ufl.split(u)
c0, mu0 = ufl.split(u0)

# Zero u
u.x.array[:] = 0.0

# Interpolate initial condition
rng = np.random.default_rng(42)
u.sub(0).interpolate(lambda x: 0.63 + 0.02 * (0.5 - rng.random(x.shape[1])))
u.x.scatter_forward()

# Compute the chemical potential df/dc
c = ufl.variable(c)
f = 100 * c**2 * (1 - c) ** 2
dfdc = ufl.diff(f, c)

# mu_(n+theta)
mu_mid = (1.0 - theta) * mu0 + theta * mu

# Weak statement of the equations
F0 = inner(c, q) * dx - inner(c0, q) * dx + dt * inner(grad(mu_mid), grad(q)) * dx
F1 = inner(mu, v) * dx - inner(dfdc, v) * dx - lmbda * inner(grad(c), grad(v)) * dx
F = F0 + F1
problem = NonlinearProblem(F, u)

def setup(problem):
    """Create nonlinear problem and Newton solver"""
    solver = NewtonSolver(MPI.COMM_WORLD, problem)
    solver.convergence_criterion = "incremental"
    solver.rtol = np.sqrt(np.finfo(default_real_type).eps) * 1e-2

    # We can customize the linear solver used inside the NewtonSolver by
    # modifying the PETSc options
    ksp = solver.krylov_solver
    opts = PETSc.Options()  # type: ignore
    option_prefix = ksp.getOptionsPrefix()
    opts[f"{option_prefix}ksp_type"] = "preonly"
    opts[f"{option_prefix}pc_type"] = "lu"
    sys = PETSc.Sys()  # type: ignore
    # For factorisation prefer MUMPS, then superlu_dist, then default.
    if sys.hasExternalPackage("mumps"):
        opts[f"{option_prefix}pc_factor_mat_solver_type"] = "mumps"
    elif sys.hasExternalPackage("superlu_dist"):
        opts[f"{option_prefix}pc_factor_mat_solver_type"] = "superlu_dist"
    ksp.setFromOptions()

    return solver

u0.x.array[:] = u.x.array
solver = setup(problem)
solver.solve(u)

# setup new solver
solver = setup(problem)
solver.solve(u)  # this produces the error

Output (Python)

[0]PETSC ERROR: ------------------------------------------------------------------------
[0]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, probably memory access out of range
[0]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
[0]PETSC ERROR: or see https://petsc.org/release/faq/#valgrind and https://petsc.org/release/faq/
[0]PETSC ERROR: configure using --with-debugging=yes, recompile, link, and run 
[0]PETSC ERROR: to get more information on the crash.
[0]PETSC ERROR: Run with -malloc_debug to check if memory corruption is causing the crash.
Abort(59) on node 0 (rank 0 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 59) - process 0

Version

main branch

DOLFINx git commit

cbb04f311a15d788f6dd7a0e1fe2e78c693f68b8

Installation

docker nightly & spack main (both at cbb04f3)

Additional information

I did not observe the issue using SNES instead of dolfinx' NewtonSolver in the code where the problem originally occurred.

RemDelaporteMathurin commented 4 months ago

We started noticing this in our CI too https://github.com/festim-dev/FESTIM/pull/764

RemDelaporteMathurin commented 4 months ago

I tested this MWE locally with Conda in WSL and can confirm the bug randomly occurs

RemDelaporteMathurin commented 4 months ago

@dajuno I don't think the issue is having multiple instances of NewtonSolver but rather something to do with the calls to petsc4py.Options()

If you don't modify the options of the krylov solver then the random segfault disappears

RemDelaporteMathurin commented 4 months ago

I've investigated this a bit more and it looks like it's a combinaison of

I've managed to isolate the issue to:

from mpi4py import MPI
import ufl
from basix.ufl import element
import dolfinx
from dolfinx.fem.petsc import NonlinearProblem
from dolfinx.nls.petsc import NewtonSolver

msh = dolfinx.mesh.create_unit_square(
    MPI.COMM_WORLD, 8, 8, dolfinx.mesh.CellType.triangle
)
P1 = element("Lagrange", msh.basix_cell(), 1)
V = dolfinx.fem.functionspace(msh, P1)
u = dolfinx.fem.Function(V)
v = ufl.TestFunction(V)

F = ufl.inner(ufl.grad(u), ufl.grad(v)) * ufl.dx
problem = NonlinearProblem(F, u)

for i in range(10):
    solver = NewtonSolver(MPI.COMM_WORLD, problem)
    ksp = solver.krylov_solver

In this MWE, if you either:

Then the segfaults doesn't occur

garth-wells commented 4 months ago

@dajuno, @RemDelaporteMathurin could you check if #3190 resolves the issue that you're seeing?

dajuno commented 4 months ago

Not sure how to build it, I tried to follow the instructions in https://github.com/FEniCS/dolfinx/blob/main/docker/Dockerfile.end-user

# To build from source, first checkout the DOLFINx, FFCx, Basix and UFL
# repositories into the working directory, e.g.:
#
# $ ls $(pwd)
# dolfinx  ffcx  basix  ufl
#
# Then run the commands:
#
#    docker pull dolfinx/dolfinx-onbuild:nightly
#    echo "FROM dolfinx/dolfinx-onbuild:nightly" | docker build -f- .

(#3190 checked out in dolfinx) but get the error:

Processing ./python
  Installing build dependencies: started
  Installing build dependencies: still running...
  Installing build dependencies: still running...
  Installing build dependencies: finished with status 'error'
  error: subprocess-exited-with-error

  × pip subprocess to install build dependencies did not run successfully.
  │ exit code: 1
  ╰─> [7 lines of output]
      WARNING: Retrying (Retry(total=4, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<pip._vendor.urllib3.connection.HTTPSConnection object at 0x7142d2ab3190>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution')': /simple/scikit-build-core/
      WARNING: Retrying (Retry(total=3, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<pip._vendor.urllib3.connection.HTTPSConnection object at 0x7142d2ab3490>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution')': /simple/scikit-build-core/
      WARNING: Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<pip._vendor.urllib3.connection.HTTPSConnection object at 0x7142d2ab3730>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution')': /simple/scikit-build-core/
      WARNING: Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<pip._vendor.urllib3.connection.HTTPSConnection object at 0x7142d2ab38e0>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution')': /simple/scikit-build-core/
      WARNING: Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<pip._vendor.urllib3.connection.HTTPSConnection object at 0x7142d2ab3a90>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution')': /simple/scikit-build-core/
      ERROR: Could not find a version that satisfies the requirement scikit-build-core>=0.5.0 (from versions: none)
      ERROR: No matching distribution found for scikit-build-core>=0.5.0
      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
error: subprocess-exited-with-error

× pip subprocess to install build dependencies did not run successfully.
│ exit code: 1
╰─> See above for output.

note: This error originates from a subprocess, and is likely not a problem with pip.
The command '/bin/sh -c cd basix && cmake -G Ninja -DCMAKE_BUILD_TYPE=${DOLFINX_CMAKE_BUILD_TYPE} -B build-dir -S ./cpp &&     cmake --build build-dir &&     cmake --install build-dir &&     python3 -m pip install ./python &&     cd ../ufl && pip3 install --no-cache-dir . &&     cd ../ffcx && pip3 install --no-cache-dir . &&     cd ../ && pip3 install --no-cache-dir ipython' returned a non-zero code: 1
RemDelaporteMathurin commented 4 months ago

@garth-wells I can't test this today (I'm also not super familiar with building dolfinx from source). Can we also add a test to the CI to test it?

RemDelaporteMathurin commented 4 months ago

@garth-wells is there a way to release a patch with the fix?

RemDelaporteMathurin commented 3 months ago

Hi @garth-wells do you know when this fix is gonna be released? We would like to upgrade to dolfinx>=0.8 but our CI won't run until this is released. Or maybe there's a workaround in the meantime?