FenicsXConcrete provoking memory segmentation errors

danielandresarcones commented 1 year ago

I was trying to train a Gaussian Process (GP) surrogate in the same script where I am using FenicsXConcrete and I started receiving a segmentation error that crashed my program. I have managed to track the error to the imports for FenicsXConcrete. The GP is trained using the library scikit-learn, which uses PETSC for the minimization processes, and is totally independant from FenicsXConcrete. When I run the program, everything works as intended just until the minimization step, where sklearn calls the routine that uses PETSc. If I don't import FenicsXConcrete, nothing crashes. I replicated the error with a minimum working example in a brand new environment. To run it, just install sklearn in your environment (conda install -c conda-forge scikit-learn) and run the following python script:

import numpy as np
from fenicsxconcrete.experimental_setup.cantilever_beam import CantileverBeam
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import ConstantKernel, RBF

if __name__ == '__main__':

    kernel = ConstantKernel()*RBF()
    gp = GaussianProcessRegressor(kernel=kernel)
    gp.fit(np.array([0,1,2,3]).reshape(-1,1),np.array([4,5,6,7]).reshape(-1,1))

Any modifications to the GP result in the same error, a SEGV:

[0]PETSC ERROR: ------------------------------------------------------------------------
[0]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, probably memory access out of range
[0]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
[0]PETSC ERROR: or see https://petsc.org/release/faq/#valgrind
[0]PETSC ERROR: or try http://valgrind.org on GNU/linux and Apple MacOS to find memory corruption errors
[0]PETSC ERROR: configure using --with-debugging=yes, recompile, link, and run 
[0]PETSC ERROR: to get more information on the crash.
[0]PETSC ERROR: Run with -malloc_debug to check if memory corruption is causing the crash.
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
with errorcode 59.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.

If I import any other module from FenicsXConcrete, the result is the same. I could just simply use a different package to train the GP that does not use PETSC, but that would not solve the problem.

Any idea about what is causing this issue and how to solve it?

eriktamsen commented 1 year ago

I have no idea, I assume it has something to do with the packages that fenicsxconcrete requires. I would guess, that it is somehow connected to dolphinx. Maybe try importing dolphinx/fenicsx without fenicsxconcrete to check. Otherwise, make your way down the list of the required packages.

joergfunger commented 1 year ago

One option is actually separating your code (e.g. using snakemake) such that you use separate conda environments for each task (if that is possible). Otherwise, could you track that down to what package in FenicsConcrete causes that problem, and then potentially contact the developers of these packages?

danielandresarcones commented 1 year ago

Apparently it is due to a known problem from DolfinX that may trigger with different packages such as matplotlib. It appears if something is installed through pip and DolfinX was installed through conda. In my case, despite installing everything fresh through conda for the MWE, sklearn was using some package that I must have installed with pip at some point (probably scipy, which I installed with probeye locally through pip). Removing every package installed through pip and force-reinstalling sklearn through conda seems to have solved the issue in the MWE, I am working now on fixing my real environment. There is not much we can do about it, but apparently it is something to be aware of when working with dolfinx.

joergfunger commented 1 year ago

What I do not understand is why there are packages required for dolfinX that are not installed via conda? Usually, we only install pip dependencies afterwards, such that this should not happen.

danielandresarcones commented 1 year ago

DolfinX and its packages were installed via conda, and just afterwards I installed the pip dependencies. The problem came from scikit learn having dependencies via pip even though being installed through conda (because of being in base from an earlier install that I didn't properly removed or something similar), but those were not present in dolfinX. I could not find any extra information appart from the post I linked, but it seems to be some conflict on which part of the memory conda reserves for running dolfinX and which part pip reserves for the other packages. For me it was only triggered when a C++ part of the package was called, so if I had to guess, I would say that dolfinX reserves at some point (through importing/initializing PETSC) "the first block of memory" for C++ routines, but that is not seen by the pip packages due to some environment mismanagement. Then, when the pip package (sklearn in my case) tries to use a C++ module, it looks for the same block of memory and finds that it is already reserved despite not having registered it in its environment, triggering the segmentation fault. This is just a guess, I am by far not that experienced with those low-level interactions.

BAMresearch / FenicsXConcrete

FenicsXConcrete provoking memory segmentation errors #102