Unresolved PETSc memory leaks in `dolfinx.fem` when running in parallel with MPI

moonlitfjords commented 1 week ago

I have previously come across some of the memory leaks relating to PETSc objects not being garbage collected and was able to get around them using the manual destruction suggestions made on discussions here.

However, I've now come across some akin to those discussed in #2552 and #2559 which seem to break multiprocessing on the current version of petsc4py being used. These manifested initially as yaksa: x leaked handle pool objects warnings, but while I haven't been able to completely isolate their source, I've managed to track them down to the creation and manipulation of dolfinx.fem objects which appear to be creating (and failing to destroy) a number of different MPI objects.

This occurs exclusively when running using mpirun across multiple cores, such that if several tasks are running in parallel each using one core there is no problem, but as soon as one or more instances are being distributed across multiple processes, the memory leaks emerge.

Having rebuilt MPI with additional debugging flags, the output from one of the cores upon termination of the script is included below. The objects in question are all arising from MPI-related files, with COMM objects first appearing upon calling fem.functionspace; fem.Function calls then create ATTR and KEYVAL objects as well, with GROUP and REQUEST objects (I think) being associated with fem.Form objects and finally DATATYPE objects being created by vector/matrix manipulations within fem.petsc.

It is the DATATYPE objects in particular that get flagged by yaksa, and adding in additional gc.collect()s does not resolve the problem.

I have not determined precisely which version this becomes broken in, but the behaviour is not present at least within petsc4py==3.19.6 and earlier, so unless anyone has any ideas on how to resolve this I'd suggest it might be worth regressing the version provided by default in the various premade Docker images?

In the meantime, for anyone else coming across this with similar issues, manually building a clean Docker image on top of Ubuntu using the apt-get instructions in the installation guide for dolfinx 0.9.0 and then adding on an older version of petsc4py has allowed me to bypass the issue. Otherwise a slow buildup of memory will eventually cause a crash when running large numbers of simulations (I was able to manage about 50 on a relatively small mesh using 8 total cores and 2 cores per simulation before approaching my machine's capacity).

Apologies if any of this isn't clear or I am missing something important/obvious, but hopefully it generally makes sense.

(Output from mpirun with debug-build of MPI:)

leaked context IDs detected: mask=0x7fdac4fb9440 mask[0]=0x7fff7e7f
leaked context IDs detected: mask=0x7fdac4fb9440 mask[1]=0xfffffdfd
leaked context IDs detected: mask=0x7fdac4fb9440 mask[2]=0x7fffffff

In direct memory block for handle type GROUP, 6 handles are still allocated
In direct memory block for handle type ATTR, 12 handles are still allocated
In indirect memory block 0 for handle type ATTR, 6 handles are still allocated
In direct memory block for handle type KEYVAL, 3 handles are still allocated
In indirect memory block 0 for handle type REQUEST, 210 handles are still allocated
In direct memory block for handle type COMM, 2 handles are still allocated
In indirect memory block 0 for handle type COMM, 12 handles are still allocated
In indirect memory block 0 for handle type DATATYPE, 102 handles are still allocated

[1] 512 at [0x5192250], src/mpi/comm/commutil.c[1380]
[1] 64 at [0x60b6370], src/mpi/comm/commutil.c[1380]
[1] 64 at [0x7eb0ac0], ./src/include/mpir_datatype.h[420]
[1] 80 at [0x7ea5830], ./src/include/mpir_datatype.h[420]
[1] 64 at [0x5b269d0], ./src/include/mpir_datatype.h[420]
[1] 80 at [0x7ee5790], ./src/include/mpir_datatype.h[420]
[1] 64 at [0x7f5dcf0], ./src/include/mpir_datatype.h[420]
[1] 80 at [0x7f4b890], ./src/include/mpir_datatype.h[420]
[1] 64 at [0x5180b80], ./src/include/mpir_datatype.h[420]
[1] 80 at [0x7f29e90], ./src/include/mpir_datatype.h[420]
[1] 64 at [0x6077710], ./src/include/mpir_datatype.h[420]
[1] 80 at [0x5c96220], ./src/include/mpir_datatype.h[420]
[1] 64 at [0x5c96840], ./src/include/mpir_datatype.h[420]
[1] 80 at [0x7f23320], ./src/include/mpir_datatype.h[420]
[1] 64 at [0x60759c0], ./src/include/mpir_datatype.h[420]
[1] 80 at [0x5b2d390], ./src/include/mpir_datatype.h[420]
[1] 64 at [0x607aee0], ./src/include/mpir_datatype.h[420]
[1] 80 at [0x7ef6080], ./src/include/mpir_datatype.h[420]
[1] 64 at [0x7f58d90], ./src/include/mpir_datatype.h[420]
[1] 80 at [0x5b642a0], ./src/include/mpir_datatype.h[420]
[1] 64 at [0x606c120], ./src/include/mpir_datatype.h[420]
[1] 80 at [0x5b30a10], ./src/include/mpir_datatype.h[420]
[1] 64 at [0x7ec71c0], ./src/include/mpir_datatype.h[420]
[1] 80 at [0x5a86040], ./src/include/mpir_datatype.h[420]
[1] 64 at [0x7ec5900], ./src/include/mpir_datatype.h[420]
[1] 80 at [0x5b5f3a0], ./src/include/mpir_datatype.h[420]
[1] 64 at [0x7f55a40], ./src/include/mpir_datatype.h[420]
[1] 80 at [0x7ee68e0], ./src/include/mpir_datatype.h[420]
[1] 64 at [0x5b3e7f0], ./src/include/mpir_datatype.h[420]
[1] 80 at [0x62b49e0], ./src/include/mpir_datatype.h[420]
[1] 64 at [0x5b3d270], ./src/include/mpir_datatype.h[420]
[1] 80 at [0x72fe560], ./src/include/mpir_datatype.h[420]
[1] 64 at [0x7e8cbd0], ./src/include/mpir_datatype.h[420]
[1] 80 at [0x7f59ab0], ./src/include/mpir_datatype.h[420]
[1] 32 at [0x5ca19a0], src/mpi/group/grouputil.c[77]
[1] 360 at [0x6293990], src/mpi/coll/src/csel.c[698]
[1] 360 at [0x6293dd0], src/mpi/coll/src/csel.c[698]
[1] 360 at [0x6293ff0], src/mpi/coll/src/csel.c[698]
[1] 360 at [0x6293bb0], src/mpi/coll/src/csel.c[698]
[1] 360 at [0x6294210], src/mpi/coll/src/csel.c[698]
[1] 360 at [0x6294430], src/mpi/coll/src/csel.c[698]
[1] 8 at [0x5b21fc0], src/util/mpir_localproc.c[160]
[1] 8 at [0x5c1ac50], src/util/mpir_localproc.c[50]
[1] 64 at [0x7eb17b0], ./src/include/mpir_datatype.h[420]
[1] 80 at [0x5b58cb0], ./src/include/mpir_datatype.h[420]
[1] 512 at [0x7f77340], src/mpi/comm/commutil.c[1380]
[1] 64 at [0x7ea09d0], src/mpi/comm/commutil.c[1380]
[1] 32 at [0x4797f20], src/mpi/group/grouputil.c[77]
[1] 360 at [0x51f2180], src/mpi/coll/src/csel.c[698]
[1] 360 at [0x51f1f60], src/mpi/coll/src/csel.c[698]
[1] 360 at [0x7f42d40], src/mpi/coll/src/csel.c[698]
[1] 360 at [0x7f42b20], src/mpi/coll/src/csel.c[698]
[1] 360 at [0x7eae3f0], src/mpi/coll/src/csel.c[698]
[1] 360 at [0x7f09b90], src/mpi/coll/src/csel.c[698]
[1] 8 at [0x7f51170], src/util/mpir_localproc.c[160]
[1] 8 at [0x5b3a860], src/util/mpir_localproc.c[50]
[1] 512 at [0x5b5b9c0], src/mpi/comm/commutil.c[1380]
[1] 64 at [0x457b650], src/mpi/comm/commutil.c[1380]
[1] 64 at [0x71953a0], ./src/include/mpir_datatype.h[420]
[1] 80 at [0x7ebbe70], ./src/include/mpir_datatype.h[420]
[1] 1024 at [0x628d1d0], src/mpi/comm/commutil.c[1380]
[1] 64 at [0x62ac550], ./src/include/mpir_datatype.h[420]
[1] 80 at [0x7307930], ./src/include/mpir_datatype.h[420]
[1] 64 at [0x62a5650], ./src/include/mpir_datatype.h[420]
[1] 80 at [0x5b960a0], ./src/include/mpir_datatype.h[420]
[1] 64 at [0x62a4ce0], ./src/include/mpir_datatype.h[420]
[1] 80 at [0x7edfd80], ./src/include/mpir_datatype.h[420]
[1] 64 at [0x62a2050], ./src/include/mpir_datatype.h[420]
[1] 80 at [0x51f9380], ./src/include/mpir_datatype.h[420]
[1] 64 at [0x7e8e1c0], ./src/include/mpir_datatype.h[420]
[1] 80 at [0x7e8e0c0], ./src/include/mpir_datatype.h[420]
[1] 64 at [0x5ed0040], ./src/include/mpir_datatype.h[420]
[1] 80 at [0x5ecff40], ./src/include/mpir_datatype.h[420]
[1] 64 at [0x5b2e9e0], ./src/include/mpir_datatype.h[420]
[1] 80 at [0x5b2e8e0], ./src/include/mpir_datatype.h[420]
[1] 64 at [0x7eaf0b0], ./src/include/mpir_datatype.h[420]
[1] 80 at [0x7eaefb0], ./src/include/mpir_datatype.h[420]
[1] 64 at [0x5b617e0], ./src/include/mpir_datatype.h[420]
[1] 80 at [0x5b324e0], ./src/include/mpir_datatype.h[420]
[1] 64 at [0x7f62260], ./src/include/mpir_datatype.h[420]
[1] 80 at [0x6072fa0], ./src/include/mpir_datatype.h[420]
[1] 64 at [0x606d1e0], ./src/include/mpir_datatype.h[420]
[1] 80 at [0x47a4190], ./src/include/mpir_datatype.h[420]
[1] 64 at [0x460f750], ./src/include/mpir_datatype.h[420]
[1] 80 at [0x6070cc0], ./src/include/mpir_datatype.h[420]
[1] 64 at [0x7f85200], ./src/include/mpir_datatype.h[420]
[1] 80 at [0x6083bb0], ./src/include/mpir_datatype.h[420]
[1] 64 at [0x5c1b4a0], ./src/include/mpir_datatype.h[420]
[1] 80 at [0x4b29190], ./src/include/mpir_datatype.h[420]
[1] 64 at [0x5c1b3b0], ./src/include/mpir_datatype.h[420]
[1] 80 at [0x4cf5160], ./src/include/mpir_datatype.h[420]
[1] 32 at [0x62a64d0], src/mpi/group/grouputil.c[77]
[1] 360 at [0x5b32700], src/mpi/coll/src/csel.c[698]
[1] 360 at [0x4cf48e0], src/mpi/coll/src/csel.c[698]
[1] 360 at [0x4cf4b00], src/mpi/coll/src/csel.c[698]
[1] 360 at [0x4cf46c0], src/mpi/coll/src/csel.c[698]
[1] 360 at [0x4cf4d20], src/mpi/coll/src/csel.c[698]
[1] 360 at [0x4cf4f40], src/mpi/coll/src/csel.c[698]
[1] 8 at [0x7f14a50], src/util/mpir_localproc.c[160]
[1] 8 at [0x47a6870], src/util/mpir_localproc.c[50]
[1] 64 at [0x5e3a1d0], ./src/include/mpir_datatype.h[420]
[1] 80 at [0x47a6930], ./src/include/mpir_datatype.h[420]
[1] 64 at [0x46e5b30], src/mpi/comm/commutil.c[1380]
[1] 32 at [0x5ca42d0], src/mpi/group/grouputil.c[77]
[1] 360 at [0x5a89aa0], src/mpi/coll/src/csel.c[698]
[1] 360 at [0x607f220], src/mpi/coll/src/csel.c[698]
[1] 360 at [0x607f000], src/mpi/coll/src/csel.c[698]
[1] 360 at [0x7f27a40], src/mpi/coll/src/csel.c[698]
[1] 360 at [0x7f27820], src/mpi/coll/src/csel.c[698]
[1] 360 at [0x7f34800], src/mpi/coll/src/csel.c[698]
[1] 8 at [0x52747c0], src/util/mpir_localproc.c[160]
[1] 8 at [0x6081080], src/util/mpir_localproc.c[50]
[1] 512 at [0x5bed900], src/mpi/comm/commutil.c[1380]
[1] 64 at [0x52e36d0], src/mpi/comm/commutil.c[1380]
[1] 64 at [0x5338050], ./src/include/mpir_datatype.h[420]
[1] 80 at [0x5336430], ./src/include/mpir_datatype.h[420]
[1] 64 at [0x5334510], ./src/include/mpir_datatype.h[420]
[1] 80 at [0x53329b0], ./src/include/mpir_datatype.h[420]
[1] 64 at [0x52ecf30], ./src/include/mpir_datatype.h[420]
[1] 80 at [0x52eb4c0], ./src/include/mpir_datatype.h[420]
[1] 64 at [0x57a3570], ./src/include/mpir_datatype.h[420]
[1] 80 at [0x57a48c0], ./src/include/mpir_datatype.h[420]
[1] 64 at [0x57a1100], ./src/include/mpir_datatype.h[420]
[1] 80 at [0x579f2e0], ./src/include/mpir_datatype.h[420]
[1] 64 at [0x579d6a0], ./src/include/mpir_datatype.h[420]
[1] 80 at [0x55200d0], ./src/include/mpir_datatype.h[420]
[1] 64 at [0x551e910], ./src/include/mpir_datatype.h[420]
[1] 80 at [0x551ce10], ./src/include/mpir_datatype.h[420]
[1] 64 at [0x551b140], ./src/include/mpir_datatype.h[420]
[1] 80 at [0x5517180], ./src/include/mpir_datatype.h[420]
[1] 64 at [0x52003f0], ./src/include/mpir_datatype.h[420]
[1] 80 at [0x51943b0], ./src/include/mpir_datatype.h[420]
[1] 64 at [0x51925f0], ./src/include/mpir_datatype.h[420]
[1] 80 at [0x5193f90], ./src/include/mpir_datatype.h[420]
[1] 64 at [0x5192500], ./src/include/mpir_datatype.h[420]
[1] 80 at [0x52f6f90], ./src/include/mpir_datatype.h[420]
[1] 64 at [0x5181c00], ./src/include/mpir_datatype.h[420]
[1] 80 at [0x534ec90], ./src/include/mpir_datatype.h[420]
[1] 64 at [0x52f4cb0], ./src/include/mpir_datatype.h[420]
[1] 80 at [0x52ddad0], ./src/include/mpir_datatype.h[420]
[1] 64 at [0x534cab0], ./src/include/mpir_datatype.h[420]
[1] 80 at [0x51f9930], ./src/include/mpir_datatype.h[420]
[1] 64 at [0x51f86f0], ./src/include/mpir_datatype.h[420]
[1] 80 at [0x52dd8e0], ./src/include/mpir_datatype.h[420]
[1] 64 at [0x5194810], ./src/include/mpir_datatype.h[420]
[1] 80 at [0x52011a0], ./src/include/mpir_datatype.h[420]
[1] 32 at [0x5c1c190], src/mpi/group/grouputil.c[77]
[1] 360 at [0x52004e0], src/mpi/coll/src/csel.c[698]
[1] 360 at [0x5200920], src/mpi/coll/src/csel.c[698]
[1] 360 at [0x5200b40], src/mpi/coll/src/csel.c[698]
[1] 360 at [0x5200700], src/mpi/coll/src/csel.c[698]
[1] 360 at [0x5200d60], src/mpi/coll/src/csel.c[698]
[1] 360 at [0x5200f80], src/mpi/coll/src/csel.c[698]
[1] 8 at [0x52eef00], src/util/mpir_localproc.c[160]
[1] 8 at [0x51f87e0], src/util/mpir_localproc.c[50]
[1] 64 at [0x5193520], ./src/include/mpir_datatype.h[420]
[1] 80 at [0x5194c90], ./src/include/mpir_datatype.h[420]
[1] 512 at [0x606fc70], src/mpi/comm/commutil.c[1380]
[1] 64 at [0x606bce0], src/mpi/comm/commutil.c[1380]
[1] 32 at [0x607e0f0], src/mpi/group/grouputil.c[77]
[1] 360 at [0x6068e20], src/mpi/coll/src/csel.c[698]
[1] 360 at [0x5ca4bc0], src/mpi/coll/src/csel.c[698]
[1] 360 at [0x5ca49a0], src/mpi/coll/src/csel.c[698]
[1] 360 at [0x5c1bf70], src/mpi/coll/src/csel.c[698]
[1] 360 at [0x5c1bd50], src/mpi/coll/src/csel.c[698]
[1] 360 at [0x60852f0], src/mpi/coll/src/csel.c[698]
[1] 8 at [0x5ca3c80], src/util/mpir_localproc.c[160]
[1] 8 at [0x606a2c0], src/util/mpir_localproc.c[50]
[1] 360 at [0x46ec520], src/mpi/coll/src/csel.c[698]
[1] 360 at [0x46ec040], src/mpi/coll/src/csel.c[698]
[1] 360 at [0x46f8a90], src/mpi/coll/src/csel.c[698]
[1] 360 at [0x46c9a70], src/mpi/coll/src/csel.c[698]
[1] 360 at [0x46db390], src/mpi/coll/src/csel.c[698]
[1] 360 at [0x472b320], src/mpi/coll/src/csel.c[698]
[1] 8 at [0x3b663b0], src/util/mpir_localproc.c[160]
[1] 8 at [0x46b25a0], src/util/mpir_localproc.c[50]
[1] 16424 at [0x2e41bf0], src/mpid/ch4/src/ch4_proc.c[164]
[WARNING] yaksa: 102 leaked handle pool objects

garth-wells commented 1 week ago

Can you post code that reproduces the warnings?

moonlitfjords commented 1 week ago

Hi @garth-wells, thanks for getting back to me - I initially tried to just explain it as it was a little entangled in my code but I think I've managed to get a more isolated snippet to reproduce the warnings now, so here you go:

from mpi4py import MPI
import ufl
from dolfinx import fem, mesh
import ufl
from dolfinx.io import XDMFFile
from petsc4py import PETSc
from pathlib import Path

comm = MPI.COMM_WORLD

xdmf_path = Path(<path-to-mesh>)

with XDMFFile(comm, xdmf_path, "r") as xdmf:
    domain = xdmf.read_mesh(name="Grid")

rho=7990.0
dx = ufl.Measure("dx", domain)

V = fem.functionspace(domain, ("CG", 1, (domain.geometry.dim,)))        ### COMM handles first introduced

u_ = ufl.TestFunction(V)
a_ = ufl.TrialFunction(V)

m_form = rho*ufl.inner(a_, u_)*dx

ones_a = fem.Function(V)
lumped_m_form = fem.form(ufl.action(m_form, ones_a))        ### ATTR and KEYVAL handles first introduced
ones_a.x.array[:] = 1.

M_inv_petsc = fem.petsc.assemble_vector(lumped_m_form)      ### GROUP and DATATYPE objects first introduced ---> DATATYPEs correspond to the yaksa leaked handle pool objects

There is of course more going on in and around all of this, but as a simpler example this seems to mostly demonstrate the behaviour which then accumulates and is more evident as more is added. The script is simply ran using mpirun -np <some-number-greater-than-1> python3 <script_name>.py

(Note: here it is simply running straightforwardly with mpirun, but if comm_rank checks are added and a list of files split up into subgroups so that each core is assigned its own set of distinct meshes to process then the problem does not manifest, but as soon as you add a comm.Split() to start sharing the subgroups of tasks across a subset of the cores it reappears)

jorgensd commented 1 week ago

Once M_inv_petsc is created, the user is responsible of destroying that object once they are done with it. i.e. M_inv_petsc.destroy() as documented in for some of the create_vector* functions in. I see that we missed updating dolfinx.fem.petsc.create_vector.

Could you try using the following modification of the code above:

from mpi4py import MPI
from petsc4py import PETSc
import ufl
from dolfinx import fem, mesh
import ufl
from dolfinx.io import XDMFFile

from pathlib import Path

comm = MPI.COMM_WORLD

xdmf_path = Path(<path-to-mesh>)

with XDMFFile(comm, xdmf_path, "r") as xdmf:
    domain = xdmf.read_mesh(name="Grid")

rho=7990.0
dx = ufl.Measure("dx", domain)

V = fem.functionspace(domain, ("CG", 1, (domain.geometry.dim,)))        ### COMM handles first introduced

u_ = ufl.TestFunction(V)
a_ = ufl.TrialFunction(V)

m_form = rho*ufl.inner(a_, u_)*dx

ones_a = fem.Function(V)
lumped_m_form = fem.form(ufl.action(m_form, ones_a))        ### ATTR and KEYVAL handles first introduced
ones_a.x.array[:] = 1.

M_inv  = fem.Function(V)
M_inv.x.array[:] = 0
fem.petsc.assemble_vector(M_inv.x.petsc_vec, lumped_m_form)

as any vector created with dolfinx.fem.Function is responsible for destroying the PETSc.Vec.

moonlitfjords commented 1 week ago

@jorgensd ah yes this seems to have eliminated the DATATYPE objects, and hence the yaksa warning. The others are still there but I don't believe they are contributing to the memory leaks, and without the additional debugging config I wouldn't even have known about them. I had tried destroying various PETSc objects wherever possible but in the wider implementation some of them need to stick around for a while so it wasn't an ideal solution and some were clearly slipping through the cracks. Your suggestion for having the separate M_inv object to take responsibility for the destruction while avoiding that issue makes sense and appears to fix things.

Thanks a lot, both for the help and for the (unbelievably) speedy response!

FEniCS / dolfinx

Unresolved PETSc memory leaks in `dolfinx.fem` when running in parallel with MPI #3522