Open sblauth opened 1 year ago
lldb gives this stack:
* frame #0: 0x000000010c8f2b79 libpetsc.3.20.0.dylib`PetscOptionsFindPair + 153
frame #1: 0x000000010c8f7595 libpetsc.3.20.0.dylib`PetscOptionsDeprecated_Private + 117
frame #2: 0x000000010c8c3ee0 libpetsc.3.20.0.dylib`PetscInfoProcessClass + 80
frame #3: 0x000000010cec8e5c libpetsc.3.20.0.dylib`MatMFFDInitializePackage + 156
frame #4: 0x000000010ced3a2d libpetsc.3.20.0.dylib`MatInitializePackage + 61
frame #5: 0x000000010cf1b5fd libpetsc.3.20.0.dylib`MatCreate + 45
frame #6: 0x000000010a830ba5 libdolfin.2019.1.0.dylib`dolfin::PETScDMCollection::create_transfer_matrix(dolfin::FunctionSpace const&, dolfin::FunctionSpace const&) + 13621
tracing it to MatCreate called here, which ultimately calls MatMFFDInitializePackage -> PetscInfoProcessClass
and finally the segfault occurs here:
PetscCall(PetscOptionsDeprecated_Private(NULL, "-info_exclude", NULL, "3.13", "Use ~ with -info to indicate classes to exclude"));
which calls:
PetscCall(PetscOptionsFindPair(options, prefix, oldname, &value, &found));
with *prefix=NULL, options=NULL, oldname="-info_exclude"
So the segfault is somewhere in PetscOptionsFindPair. I can reproduce this with the simple C program:
#include "petsc.h"
int main() {
PetscBool found;
const char* value;
char *prefix = NULL;
PetscOptions options = NULL;
const char oldname[] = "-info_exclude";
PetscCall(PetscOptionsFindPair(options, prefix, oldname, &value, &found));
return 0;
}
I think I tracked it down to a missing PetscOptionsCreateDefault();
, so the default options are undefined. Not sure whose responsibility that is.
Ultimately, I think there's a missing call to PetscInitialize
, which avoids this segfault. I don't know what changed in petsc 3.20 that caused this, but adding:
SubSystemsManager.init_petsc()
appears to be a workaround to ensure petsc is initialized. I don't understand enough to say precisely when that should be called in fenics itself, but it seems to be missing somehow.
Ultimately, the call that's failing is this one, failing with "invalid communicator". I don't really know what's going on there, but presumably it's something in initialization and/or passing around of the mpi comms.
PetscDolfinErrorHandler: line '208', function 'PetscCommDuplicate', file '/Users/runner/miniforge3/conda-bld/petsc_1696941223631/work/src/sys/objects/tagm.c',
: error code '98' (General MPI error), message follows:
------------------------------------------------------------------------------
MPI error 70865157 Invalid communicator, error stack:
MPII_Comm_get_attr(85): MPI_Comm_get_attr(comm=0x84000003, keyval=0xa4400000, attribute_val=0x305e76450, flag=0x305e76444) failed
MPII_Comm_get_attr(58): Invalid communicator
------------------------------------------------------------------------------
PetscDolfinErrorHandler: line '51', function 'PetscHeaderCreate_Private', file '/Users/runner/miniforge3/conda-bld/petsc_1696941223631/work/src/sys/objects/inherit.c',
: error code '98' (General MPI error), message follows:
------------------------------------------------------------------------------
------------------------------------------------------------------------------
PetscDolfinErrorHandler: line '26', function 'PetscHeaderCreate_Function', file '/Users/runner/miniforge3/conda-bld/petsc_1696941223631/work/src/sys/objects/inherit.c',
: error code '98' (General MPI error), message follows:
------------------------------------------------------------------------------
------------------------------------------------------------------------------
PetscDolfinErrorHandler: line '101', function 'VecCreateWithLayout_Private', file '/Users/runner/miniforge3/conda-bld/petsc_1696941223631/work/src/vec/vec/interface/veccreate.c',
: error code '98' (General MPI error), message follows:
------------------------------------------------------------------------------
------------------------------------------------------------------------------
PetscDolfinErrorHandler: line '9468', function 'MatCreateVecs', file '/Users/runner/miniforge3/conda-bld/petsc_1696941223631/work/src/mat/interface/matrix.c',
: error code '98' (General MPI error), message follows:
------------------------------------------------------------------------------
Debug with:
# from mpi4py import MPI
# from mpi4py import MPI
# import petsc4py
# #
# petsc4py.init() #["-log_view", "-log_trace", "trace.txt"])#, "-on_error_attach_debugger"])
# from petsc4py import PETSc
import dolfin
from dolfin import UnitSquareMesh, FunctionSpace, as_backend_type, PETScDMCollection, SubSystemsManager
dolfin.cpp.log.set_log_level(dolfin.cpp.log.LogLevel.DEBUG)
SubSystemsManager.init_petsc()
mesh = UnitSquareMesh(comm=MPI.COMM_WORLD, nx=8, ny=8)
V = FunctionSpace(mesh, 'CG', 1)
W = FunctionSpace(mesh, 'CG', 2)
transfer_matrix = as_backend_type(PETScDMCollection.create_transfer_matrix(V, W)).mat()
print(transfer_matrix)
_, temp = transfer_matrix.createVecs()
print(temp)
I've tried your code and can only reproduce the exact same behavior - any idea how to fix this? Or is this rather the area of expertise of the PETSc developers?
I'm not sure who is going to know why the MPI communicator is invalid. Clearly something has changed in the initialization somewhere in 3.20, but I can't figure out what, since it's passed around so much.
My hunch is that the communicator handle that fenics grabs by default is not quite right, and maybe the petsc communicator pointer or struct changed, or something like that. Or that an object is reinitialized, but used after being replaced.
The workaround for now is to pin petsc=3.19
in your env to avoid getting the problematic builds.
So I've been trying to further explore the error, and I'm still rather confused. I have discovered that all of fenics' own transfer matrix tests pass with the new builds, so I'm wondering if there's perhaps some error in how createVecs is being called, or under what circumstances it's safe?
In any case, the stack is:
and the problem appears to be in the value of mat->cmap->comm
. I'm not quite sure how to debug what the value of that is or what it should be.
I have a small update for this issue. This appears to be also happening for other versions of PETSc. I just encountered the same issue on my workstation where I use PETSc 3.17.4. There, the same error (resulting in a segfault) appears for some mesh I am working with. Unfortunately I cannot share the mesh due to confidentiality reasons.
Strangely, the error occurs infrequently. For example, the segfault happens when I use a VectorFunctionSpace
, but not with a scalar FunctionSpace
, and it also works for some "DG"
elements - so it seems to be very inconsistent.
I could not fix the issue by either calling
import petsc4py
petsc4py.init()
from petsc4py import PETSc
or calling
fenics.SubSystemsManager.init_petsc()
Do you have any idea why this fails? Is this to be (somewhat) expected? Is there any workaround for this issue? (I have seen your posts on the fenics discord and the PETSc gitlab, so also many thanks for investigating this issue further).
Dear @sblauth,
Not sure if this is useful, but my workaround for this problem is to specify the version of mpi4py
and hdf5
as follows.
It is not an ideal solution, but it seems to avoid the problem with create_transfer_matrix
in our software.
- hdf5=1.12.2
- mpi4py=3.1.4
Best, Kei
Regarding the error: I have noticed that I don't need to call createVecs
at all. And apparently, the matrix itself contains the correct data, so that I can do with it what I want (namely multiply it with an existing vec).
So this is, of course, still problematic, but as long as one does not need the createVecs
, it works.
Solution to issue cannot be found in the documentation.
Issue
Hi, I am having an issue using
fenics.PETScDMCollection
or rather the result of its methodcreate_transfer_matrix
. I have a MWE here:Using this, I get a segmentation fault or PETSc error 98 (general MPI error, even though I am not using MPI to run the code).
Is this expected? What workaround is possible? I am using this functionality inside my package to create a reusable interpolation matrix.
Many thanks in advance.
Installed packages
Environment info