Closed moonlitfjords closed 1 week ago
Can you post code that reproduces the warnings?
Hi @garth-wells, thanks for getting back to me - I initially tried to just explain it as it was a little entangled in my code but I think I've managed to get a more isolated snippet to reproduce the warnings now, so here you go:
from mpi4py import MPI
import ufl
from dolfinx import fem, mesh
import ufl
from dolfinx.io import XDMFFile
from petsc4py import PETSc
from pathlib import Path
comm = MPI.COMM_WORLD
xdmf_path = Path(<path-to-mesh>)
with XDMFFile(comm, xdmf_path, "r") as xdmf:
domain = xdmf.read_mesh(name="Grid")
rho=7990.0
dx = ufl.Measure("dx", domain)
V = fem.functionspace(domain, ("CG", 1, (domain.geometry.dim,))) ### COMM handles first introduced
u_ = ufl.TestFunction(V)
a_ = ufl.TrialFunction(V)
m_form = rho*ufl.inner(a_, u_)*dx
ones_a = fem.Function(V)
lumped_m_form = fem.form(ufl.action(m_form, ones_a)) ### ATTR and KEYVAL handles first introduced
ones_a.x.array[:] = 1.
M_inv_petsc = fem.petsc.assemble_vector(lumped_m_form) ### GROUP and DATATYPE objects first introduced ---> DATATYPEs correspond to the yaksa leaked handle pool objects
There is of course more going on in and around all of this, but as a simpler example this seems to mostly demonstrate the behaviour which then accumulates and is more evident as more is added. The script is simply ran using mpirun -np <some-number-greater-than-1> python3 <script_name>.py
(Note: here it is simply running straightforwardly with mpirun, but if comm_rank checks are added and a list of files split up into subgroups so that each core is assigned its own set of distinct meshes to process then the problem does not manifest, but as soon as you add a comm.Split() to start sharing the subgroups of tasks across a subset of the cores it reappears)
Once M_inv_petsc
is created, the user is responsible of destroying that object once they are done with it.
i.e. M_inv_petsc.destroy()
as documented in for some of the create_vector*
functions in.
I see that we missed updating dolfinx.fem.petsc.create_vector
.
Could you try using the following modification of the code above:
from mpi4py import MPI
from petsc4py import PETSc
import ufl
from dolfinx import fem, mesh
import ufl
from dolfinx.io import XDMFFile
from pathlib import Path
comm = MPI.COMM_WORLD
xdmf_path = Path(<path-to-mesh>)
with XDMFFile(comm, xdmf_path, "r") as xdmf:
domain = xdmf.read_mesh(name="Grid")
rho=7990.0
dx = ufl.Measure("dx", domain)
V = fem.functionspace(domain, ("CG", 1, (domain.geometry.dim,))) ### COMM handles first introduced
u_ = ufl.TestFunction(V)
a_ = ufl.TrialFunction(V)
m_form = rho*ufl.inner(a_, u_)*dx
ones_a = fem.Function(V)
lumped_m_form = fem.form(ufl.action(m_form, ones_a)) ### ATTR and KEYVAL handles first introduced
ones_a.x.array[:] = 1.
M_inv = fem.Function(V)
M_inv.x.array[:] = 0
fem.petsc.assemble_vector(M_inv.x.petsc_vec, lumped_m_form)
as any vector created with dolfinx.fem.Function
is responsible for destroying the PETSc.Vec
.
@jorgensd ah yes this seems to have eliminated the DATATYPE objects, and hence the yaksa warning. The others are still there but I don't believe they are contributing to the memory leaks, and without the additional debugging config I wouldn't even have known about them. I had tried destroying various PETSc objects wherever possible but in the wider implementation some of them need to stick around for a while so it wasn't an ideal solution and some were clearly slipping through the cracks. Your suggestion for having the separate M_inv object to take responsibility for the destruction while avoiding that issue makes sense and appears to fix things.
Thanks a lot, both for the help and for the (unbelievably) speedy response!
I have previously come across some of the memory leaks relating to PETSc objects not being garbage collected and was able to get around them using the manual destruction suggestions made on discussions here.
However, I've now come across some akin to those discussed in #2552 and #2559 which seem to break multiprocessing on the current version of
petsc4py
being used. These manifested initially asyaksa: x leaked handle pool objects
warnings, but while I haven't been able to completely isolate their source, I've managed to track them down to the creation and manipulation ofdolfinx.fem
objects which appear to be creating (and failing to destroy) a number of different MPI objects.This occurs exclusively when running using
mpirun
across multiple cores, such that if several tasks are running in parallel each using one core there is no problem, but as soon as one or more instances are being distributed across multiple processes, the memory leaks emerge.Having rebuilt MPI with additional debugging flags, the output from one of the cores upon termination of the script is included below. The objects in question are all arising from MPI-related files, with COMM objects first appearing upon calling
fem.functionspace
;fem.Function
calls then create ATTR and KEYVAL objects as well, with GROUP and REQUEST objects (I think) being associated withfem.Form
objects and finally DATATYPE objects being created by vector/matrix manipulations withinfem.petsc
.It is the DATATYPE objects in particular that get flagged by
yaksa
, and adding in additionalgc.collect()
s does not resolve the problem.I have not determined precisely which version this becomes broken in, but the behaviour is not present at least within
petsc4py==3.19.6
and earlier, so unless anyone has any ideas on how to resolve this I'd suggest it might be worth regressing the version provided by default in the various premade Docker images?In the meantime, for anyone else coming across this with similar issues, manually building a clean Docker image on top of Ubuntu using the
apt-get
instructions in the installation guide fordolfinx 0.9.0
and then adding on an older version ofpetsc4py
has allowed me to bypass the issue. Otherwise a slow buildup of memory will eventually cause a crash when running large numbers of simulations (I was able to manage about 50 on a relatively small mesh using 8 total cores and 2 cores per simulation before approaching my machine's capacity).Apologies if any of this isn't clear or I am missing something important/obvious, but hopefully it generally makes sense.
(Output from
mpirun
with debug-build of MPI:)