Closed dajuno closed 4 months ago
We started noticing this in our CI too https://github.com/festim-dev/FESTIM/pull/764
I tested this MWE locally with Conda in WSL and can confirm the bug randomly occurs
@dajuno I don't think the issue is having multiple instances of NewtonSolver
but rather something to do with the calls to petsc4py.Options()
If you don't modify the options of the krylov solver then the random segfault disappears
I've investigated this a bit more and it looks like it's a combinaison of
NewtonSolver
instances and.krylov_solver
I've managed to isolate the issue to:
from mpi4py import MPI
import ufl
from basix.ufl import element
import dolfinx
from dolfinx.fem.petsc import NonlinearProblem
from dolfinx.nls.petsc import NewtonSolver
msh = dolfinx.mesh.create_unit_square(
MPI.COMM_WORLD, 8, 8, dolfinx.mesh.CellType.triangle
)
P1 = element("Lagrange", msh.basix_cell(), 1)
V = dolfinx.fem.functionspace(msh, P1)
u = dolfinx.fem.Function(V)
v = ufl.TestFunction(V)
F = ufl.inner(ufl.grad(u), ufl.grad(v)) * ufl.dx
problem = NonlinearProblem(F, u)
for i in range(10):
solver = NewtonSolver(MPI.COMM_WORLD, problem)
ksp = solver.krylov_solver
In this MWE, if you either:
Then the segfaults doesn't occur
@dajuno, @RemDelaporteMathurin could you check if #3190 resolves the issue that you're seeing?
Not sure how to build it, I tried to follow the instructions in https://github.com/FEniCS/dolfinx/blob/main/docker/Dockerfile.end-user
# To build from source, first checkout the DOLFINx, FFCx, Basix and UFL
# repositories into the working directory, e.g.:
#
# $ ls $(pwd)
# dolfinx ffcx basix ufl
#
# Then run the commands:
#
# docker pull dolfinx/dolfinx-onbuild:nightly
# echo "FROM dolfinx/dolfinx-onbuild:nightly" | docker build -f- .
(#3190 checked out in dolfinx) but get the error:
Processing ./python
Installing build dependencies: started
Installing build dependencies: still running...
Installing build dependencies: still running...
Installing build dependencies: finished with status 'error'
error: subprocess-exited-with-error
× pip subprocess to install build dependencies did not run successfully.
│ exit code: 1
╰─> [7 lines of output]
WARNING: Retrying (Retry(total=4, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<pip._vendor.urllib3.connection.HTTPSConnection object at 0x7142d2ab3190>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution')': /simple/scikit-build-core/
WARNING: Retrying (Retry(total=3, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<pip._vendor.urllib3.connection.HTTPSConnection object at 0x7142d2ab3490>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution')': /simple/scikit-build-core/
WARNING: Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<pip._vendor.urllib3.connection.HTTPSConnection object at 0x7142d2ab3730>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution')': /simple/scikit-build-core/
WARNING: Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<pip._vendor.urllib3.connection.HTTPSConnection object at 0x7142d2ab38e0>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution')': /simple/scikit-build-core/
WARNING: Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<pip._vendor.urllib3.connection.HTTPSConnection object at 0x7142d2ab3a90>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution')': /simple/scikit-build-core/
ERROR: Could not find a version that satisfies the requirement scikit-build-core>=0.5.0 (from versions: none)
ERROR: No matching distribution found for scikit-build-core>=0.5.0
[end of output]
note: This error originates from a subprocess, and is likely not a problem with pip.
error: subprocess-exited-with-error
× pip subprocess to install build dependencies did not run successfully.
│ exit code: 1
╰─> See above for output.
note: This error originates from a subprocess, and is likely not a problem with pip.
The command '/bin/sh -c cd basix && cmake -G Ninja -DCMAKE_BUILD_TYPE=${DOLFINX_CMAKE_BUILD_TYPE} -B build-dir -S ./cpp && cmake --build build-dir && cmake --install build-dir && python3 -m pip install ./python && cd ../ufl && pip3 install --no-cache-dir . && cd ../ffcx && pip3 install --no-cache-dir . && cd ../ && pip3 install --no-cache-dir ipython' returned a non-zero code: 1
@garth-wells I can't test this today (I'm also not super familiar with building dolfinx from source). Can we also add a test to the CI to test it?
@garth-wells is there a way to release a patch with the fix?
Hi @garth-wells do you know when this fix is gonna be released? We would like to upgrade to dolfinx>=0.8 but our CI won't run until this is released. Or maybe there's a workaround in the meantime?
Summarize the issue
Using multiple instances of
NewtonSolver
in a script may fail with a PETSc Segmentation violation. The MWE below fails in approx 50% of the runs.Tested both on docker and with the spack build @ cbb04f311
How to reproduce the bug
Run the following MWE, adapted from the Cahn-Hilliard demo, multiple times.
Minimal Example (Python)
Output (Python)
Version
main branch
DOLFINx git commit
cbb04f311a15d788f6dd7a0e1fe2e78c693f68b8
Installation
docker nightly & spack main (both at cbb04f3)
Additional information
I did not observe the issue using SNES instead of dolfinx'
NewtonSolver
in the code where the problem originally occurred.