Reopening the discussion regarding `MPI_Comm_dup`

uvilla commented 8 months ago

Hi there,

I know this issue has already been discussed in several threads, but I would like to understand better the rationale behind using MPI_Comm_dup when wrapping the MPI comm in dolfinx.

With a student of mine, we are finally (it was overdue) porting hIPPYlib (https://hippylib.github.io/) to dolfinx, and we are quickly facing the Too many communicators issue.

Our code does not store any long list of dlx.fem.Function but we do create quite a lot of temporary dlx.fem.Function and dlx.fem.form instances in our code. All of these should be destroyed when the function returns, but it seems that is not the case for us.

While this may be a problem of the garbage collection in python, it is still an issue that affects the usability of the library.

So my questions are:

Is duplication of the MPI Comm truly needed?
Could an option of not to duplicate the communicator be exposed to the final user, without having to recompile the library?

Originally posted by @uvilla in https://github.com/FEniCS/dolfinx/issues/2308#issuecomment-1954583103

uvilla commented 8 months ago

I tried to write a simple code to reproduce the issue. Unfortunately, I did not observe the MPI Comm duplication issue, but I can see some memory leakage.

from mpi4py import MPI
import dolfinx as dlx
import dolfinx.fem.petsc
import ufl

def run(Vh):
    u = dlx.fem.Function(Vh)
    v = ufl.TestFunction(Vh)
    form = dlx.fem.form( ufl.inner(u,v)*ufl.dx )
    vec = dlx.fem.petsc.assemble_vector(form)    

if __name__=='__main__':
    comm = MPI.COMM_WORLD
    rank = comm.rank

    mesh = dlx.mesh.create_unit_square(MPI.COMM_WORLD, 8, 8, dlx.mesh.CellType.quadrilateral)
    V = dlx.fem.FunctionSpace(mesh, ("Lagrange", 1 ) )

    for i in range(2**32-1):
        if (i % 10 == 0) and (rank == 0): print(i)

I run this snippet using 8 processors. At the beginning the memory allocation was about ~300MB per process and after about 1 hour of running is ~3.5GB per process.

I'm running FeniCS through conda on an Intel MacOS laptop. (python-3.10, fenics-libdolfinx 0.7.0 h2cc1089_100

jorgensd commented 8 months ago

I've been running your script through docker (v0.7.3) for 100 % of the runtime (using 3 processes), and see no increase in memory usage:/

Could you try adding explicit usage of

import gc
# Add code here

for i in range(N):
    # Define u, mesh etc.
   del u, mesh
   gc.collect()

uvilla commented 8 months ago

@jorgensd : Thank you for your prompt reply.

I tried using the docker container as you recommended. The increase in memory is much less significant than using conda but still present.

For example (again with 8 processors) I observed an increase of a factor 8 in memory usage (monitored using the top command) between iteration 100 and 60,000 (the example above has been running for about 4 hours). Memory usage at iteration 100 was about 82MB per processor but it climbed to 665MB at iteration 60,000.

jorgensd commented 8 months ago

from mpi4py import MPI import dolfinx as dlx import dolfinx.fem.petsc import ufl

def run(Vh): u = dlx.fem.Function(Vh) v = ufl.TestFunction(Vh) form = dlx.fem.form( ufl.inner(u,v)*ufl.dx ) vec = dlx.fem.petsc.assemble_vector(form)

if name=='main': comm = MPI.COMM_WORLD rank = comm.rank
mesh = dlx.mesh.create_unit_square(MPI.COMM_WORLD, 8, 8, dlx.mesh.CellType.quadrilateral)
V = dlx.fem.FunctionSpace(mesh, ("Lagrange", 1 ) )

for i in range(2**32-1):
    if (i % 10 == 0) and (rank == 0): print(i)

@uvilla I guess you forgot to add run(V) to your loop?

jorgensd commented 8 months ago

With the following code, I do see a steady increase in memory usage:

from mpi4py import MPI
import dolfinx as dlx
import dolfinx.fem.petsc
import ufl
import gc

def run(Vh):
    u = dlx.fem.Function(Vh)
    v = ufl.TestFunction(Vh)
    form = dlx.fem.form( ufl.inner(u,v)*ufl.dx )
    vec = dlx.fem.petsc.assemble_vector(form)   
    del vec
    del u

if __name__=='__main__':
    comm = MPI.COMM_WORLD
    rank = comm.rank
    N = 2**32-1
    mesh = dlx.mesh.create_unit_square(MPI.COMM_WORLD, 8, 8, dlx.mesh.CellType.quadrilateral)
    V = dlx.fem.FunctionSpace(mesh, ("Lagrange", 1 ) )
    for i in range(N):
        run(V)    
        if (i % 1e2 == 0) and (rank == 0): print(f"{i}/{N}, {i/N*100:.2e}%", end="\r", flush=True)
        gc.collect()

Starts at around 90 MB, after 10000 iterations it is at 155 MB and steadily increases.

jorgensd commented 8 months ago

@uvilla Changing your code to:

def run(Vh):
    u = dlx.fem.Function(Vh)
    v = ufl.TestFunction(Vh)
    form = dlx.fem.form( ufl.inner(u,v)*ufl.dx )
    vec = dlx.fem.petsc.assemble_vector(form)
    vec.destroy()

or

from mpi4py import MPI
import dolfinx as dlx
import dolfinx.fem.petsc
import ufl

def run(Vh):
    u = dlx.fem.Function(Vh)
    v = ufl.TestFunction(Vh)
    form = dlx.fem.form( ufl.inner(u,v)*ufl.dx )
    vec = dolfinx.fem.Function(Vh)
    dlx.fem.petsc.assemble_vector(vec.vector, form)

if __name__=='__main__':
    comm = MPI.COMM_WORLD
    rank = comm.rank
    N = 100000
    mesh = dlx.mesh.create_unit_square(MPI.COMM_WORLD, 8, 8, dlx.mesh.CellType.quadrilateral)
    V = dlx.fem.FunctionSpace(mesh, ("Lagrange", 1 ) )
    for i in range(N):
        run(V)    
        if (i % 1e2 == 0) and (rank == 0): print(f"{i}/{N}, {i/N*100:.2e}%", end="\r", flush=True)

(as dolfinx.fem.Function destroys any associated PETSc vector at destruction).

The issue is that PETSc doesn't do garbage collection in Python automatically any longer: https://gitlab.com/petsc/petsc/-/issues/1309

uvilla commented 8 months ago

@jorgensd

Thank you very much for your help with this.

I was not aware of that issue with PETSc Vec. This explains a lot of strange things happening in my code.

In hIPPYlib, I am often dealing with Vector that are not always associated to a Function Space, think e.g. pointwise observations.

I am wondering if it would be useful to implement a property dolfinx.la.Vector.vector that creates the petsc4py wrapper and then delete it when the dolfinx.la.Vector goes out of scope.

Thank you :)

jorgensd commented 8 months ago

The dolfinx.la.Vector does not have a .vector property that interfaces with PETSc. It is only the dolfinx.fem.Function that has a .vector property that creates a wrapper around its data compatible with petsc4py: https://github.com/FEniCS/dolfinx/blob/main/python/dolfinx/fem/function.py#L475-L491

uvilla commented 8 months ago

This is in fact what I am proposing.

dolfinx.la.Vector should expose a property .vector implemented as the current implementation of dolfinx.fem.Function.

Then, to preserve backward compatibility, implement dolfinx.fem.Function.vector as

@property
def vector(self):
    return self.x.vector()

I can create a simple PR for this, if you like the proposal.

jhale commented 8 months ago

The current design is consistent with the behaviour of petsc4py; you need to manually manage memory if a DOLFINx routine returns a petsc4py object. This also applies in the C++ interface (for PETSc objects) see e.g. https://github.com/FEniCS/dolfinx/blob/main/cpp/dolfinx/la/petsc.h#L49.

We should improve our documentation to make it clear that this is the callers responsibility.

jorgensd commented 8 months ago

Is there a specific motivation for using the petsc vector over the DOLFINx vector when working with DOLFINx? The DOLFINx one has clear memory management and easy access to ghost data.

garth-wells commented 8 months ago

@uvilla We could add an option to not duplicate the communicator, or perhaps better we could attach a new 'library' communicator.

It's a poor design to use a user communicator in a library because the user or the library could change the communicator or do something on the communicator that affects the other (see Eijkhout, p. 204). The obvious options to deal with this are:

Duplicate the communicator whenever a class holds onto a communicator, which is what DOLFINx does. This is simple and memory safe, but you can in some cases exhaust the available communicators.
Attach a 'library' communicator to the user's communicator, and use the communicator in the library. My understanding is that this is what PETSc does. The advantage is that the issue of exhausting the number of communicators is avoided. Disadvantages are some extra complexity and potential memory issues; if the library attaches a communicator to a communicator that is then destroyed, memory issues arise. We have seen problems with PETSc because of this.

On petsc4py, the only robust solution is calling destroy explicitly.

uvilla commented 8 months ago

@garth-wells @jhale and @jorgensd :

Thank you very much for the clear explanation and discussion. Also thank you for kindly explaining to me memory management in petsc4py.

I do agree with the concept of a library communicator to isolate communications in dolfinx from third-party libraries. However, I also see how the one-communicator-per-object approach may get in the way of many uncertainty-quantification workflows.

A happy compromise would probably be to duplicate the communicator at the level of the mesh object (where the mesh partitioning defines the topology of the neighbors) or at the level of the FunctionSpace (which defines the communication pattern for the FE assembly).

I wonder if having the FunctionSpace, rather than the Function object build the scatterer object (which holds the MPI_Comm) may be a good compromise between robustness and avoiding too many duplications of the communicator. Since fenicsx uses smart pointers there should not be any issues with inadvertently destroying the shared communicator/scatterer.

I'll be happy to contribute if anything I am proposing is compatible with your vision on how dolfinx should behave.

garth-wells commented 8 months ago

See #3065.

garth-wells commented 8 months ago

I made a small change in #3085 that avoid communicator duplication for Vectors that are on a one-rank communicator. Of the size of the communicator is one, the Scatterer doesn't create any communicators.

zsmb commented 2 months ago

os.environ['OMP_NUM_THREADS'] = '1'
import dolfinx.mesh
import dolfinx.fem
import dolfinx.io 
from mpi4py import MPI
import gc

comm = MPI.COMM_WORLD
for i in range(3000):
    print(f"{comm.rank}, {i}")
    comm_cp = MPI.COMM_WORLD.Dup()
    msh = dolfinx.mesh.create_rectangle(comm_cp,
                        points=((0.0, 0.0), (1, 1)),
                        n=(int(4), int(4)),
                        cell_type=dolfinx.mesh.CellType.triangle, # quadrilateral triangle
                        ghost_mode=dolfinx.mesh.GhostMode.shared_facet
                        )
    msh.comm.Free()
    del msh
    gc.collect()

I am doing a simulation which need remesh for every few time steps. But the above codes fail after 2035 steps on 2 processes because of too many communicator. Is there any way to free the communicators?

Shenly0202 commented 2 months ago

os.environ['OMP_NUM_THREADS'] = '1'
import dolfinx.mesh
import dolfinx.fem
import dolfinx.io 
from mpi4py import MPI
import gc

comm = MPI.COMM_WORLD
for i in range(3000):
    print(f"{comm.rank}, {i}")
    comm_cp = MPI.COMM_WORLD.Dup()
    msh = dolfinx.mesh.create_rectangle(comm_cp,
                        points=((0.0, 0.0), (1, 1)),
                        n=(int(4), int(4)),
                        cell_type=dolfinx.mesh.CellType.triangle, # quadrilateral triangle
                        ghost_mode=dolfinx.mesh.GhostMode.shared_facet
                        )
    msh.comm.Free()
    del msh
    gc.collect()

I am doing a simulation which need remesh for every few time steps. But the above codes fail after 2035 steps on 2 processes because of too many communicator. Is there any way to free the communicators?

Hi, I am facing exactly the same problem. I need to create a series of meshes in my loop. Then this error appears. I think something goes wrong with the creation of the communicator when the mesh is built. Is there any possibility to release them since I won't use the data of the old meshes in new time steps.

FEniCS / dolfinx

Reopening the discussion regarding `MPI_Comm_dup` #3061