Closed francesco-ballarin closed 2 years ago
A DOLFINx Function contains a la::Vector
that has it's own scatterer (to manage MPI communication). This scatterer has a neighbourhood communicator.
@IgorBaratta, do you have any idea of how we could rewrite this.
In general, it is not advisable to create dolfinx.fem.Function
, petsc matrices/vectors in loops and storing all of them, as this increases the memory usage of the problem.
Is there a specific reason for wanting to store the function at every time step in your more complex application? What about outputting them using VTXWriter
instead?
Some background: it is not good practice to 'share' MPI communicators since (1) there is no guarantee that someone holding the communicator (which could be an external library) won't modify the communicator, and (2) things can go wrong if two parts of a code attempt non-blocking communication on the same communicator. This is why we duplicate communicators. A different approach, which I believe PETSc uses, is to attach new communicators to the communicator that is passed to a constructor.
A 'solution' to the reported issue would be to support 'multi-vectors', and by extension 'multi-functions'. We have discussed supporting 'multi-vectors', see https://github.com/FEniCS/dolfinx/issues/1857.
In general, it is not advisable to create dolfinx.fem.Function, petsc matrices/vectors in loops and storing all of them, as this increases the memory usage of the problem.
Is there a specific reason for wanting to store the function at every time step in your more complex application? What about outputting them using VTXWriter instead?
In the context of reduced order modelling and/or machine learning, there is often no sensible way to work around this, at least in the most basic approaches. You need to create a dataset of the solution field over time (and/or parameter), and thus you need to have all of them loaded in the memory at the same time.
In general, it is not advisable to create dolfinx.fem.Function, petsc matrices/vectors in loops and storing all of them, as this increases the memory usage of the problem. Is there a specific reason for wanting to store the function at every time step in your more complex application? What about outputting them using VTXWriter instead?
In the context of reduced order modelling and/or machine learning, there is often no sensible way to work around this, at least in the most basic approaches. You need to create a dataset of the solution field over time (and/or parameter), and thus you need to have all of them loaded in the memory at the same time.
You can store the pure numpy arrays local to each process instead, and simply use a single function with a single communicator.
You can store the pure numpy arrays local to each process instead, and simply use a single function with a single communicator.
Sure, we already doing that in the meantime, but IMHO that is a workaround, not a solution.
Minimal reproducible example:
import dolfinx
from mpi4py import MPI
comm=MPI.COMM_WORLD
msh = dolfinx.mesh.create_interval(comm, 10, points=(0.0, 1.0))
V = dolfinx.fem.FunctionSpace(msh, ("Lagrange", 1))
sols = list()
for i in range(2048):
print(i)
u = dolfinx.fem.Function(V)
sols.append(u)
Output:
Other MPI error, error stack:
internal_Dist_graph_create_adjacent(125).: MPI_Dist_graph_create_adjacent(comm=0x84000006, indegree=0, sources=(nil), sourceweights=0x7f3411b9ea74, outdegree=0, destinations=(nil), destweights=0x7f3411b9ea74, MPI_INFO_NULL, reorder=0, comm_dist_graph=0x7fffd29c2d18) failed
MPIR_Dist_graph_create_adjacent_impl(325):
MPII_Comm_copy(830)......................:
MPIR_Get_contextid_sparse_group(591).....: Too many communicators (0/2048 free on this process; ignore_id=0)
Abort(403268367) on node 0 (rank 0 in comm 416): application called MPI_Abort(comm=0x84000006, 403268367) - process 0
As @jorgensd mentioned, the issue comes from the fact that when we call dolfinx.fem.Function(V)
the default constructor of the la::Vector
is called, creating a new scatterer which creates its own neighbor communicator.
However instead of creating a new function with dolfinx.fem.Function(V)
one can call function.copy()
.
In this case the scatterer is shared between functions (and their respective vectors).
import dolfinx
from mpi4py import MPI
comm=MPI.COMM_WORLD
msh = dolfinx.mesh.create_interval(comm, 10, points=(0.0, 1.0))
V = dolfinx.fem.FunctionSpace(msh, ("Lagrange", 1))
sols = list()
u = dolfinx.fem.Function(V)
for i in range(2048):
print(i)
un = u.copy()
sols.append(un)
@francesco-ballarin , do you think this is an acceptable solution? Alternatively, we can expose a more verbose way of creating functions in python.
Thanks @IgorBaratta. It would be great it there were a way (even more verbose) to create functions that could account for this case. I can try to use the .copy()
as a workaround, but if there was a long term fix that would be great!
I think a more verbose alternative could be:
V = dolfinx.fem.FunctionSpace(msh, ("Lagrange", 1))
map = V.dofmap.index_map
scatterer = dolfinx.common.Scatterer(map, bs) # add this wrapper
for i in range(2048):
x = dolfinx.la.vector(scatterer) # create a new constructor
solution = dolfinx.fem.Function(V, x=x, dtype=np.float64)
The scatterer is explicitly created and then passed to the vector ...
But this is basically what the .copy()
function is doing under the hood.
I am closing this since we are happy with using dolfinx.fem.Function.copy()
, and thus we will not need the more verbose alternative after all. Thanks @IgorBaratta for the suggestion!
Just to report that we had a user in https://fenicsproject.discourse.group/t/error-with-mpi-in-transient-simulation/13182 encountering this.
In their case, they cannot really use the workaround with dolfinx.fem.Function.copy()
because the mesh changes at every iteration. The minimal example corresponding to their case would be along the lines of
import mpi4py.MPI
import dolfinx
sols = list()
for i in range(2**17):
print(i)
msh = dolfinx.mesh.create_unit_interval(mpi4py.MPI.COMM_WORLD, 10)
V = dolfinx.fem.functionspace(msh, ("Lagrange", 1))
u = dolfinx.fem.Function(V)
sols.append(u)
where the definition of msh
inside the for
loop is to mimick the fact that the mesh is changing.
I am not reopening this because I think that this use case is very rare, and not worth of over complicating the interface to tackle this. Still, I am reporting this here for reference to future users/developers, or if anyone among the developers has differing opinion about this issue needing to be reopened.
I had a similar problem, where I was loading ~60k saved meshes, interpolating a Function on each mesh, and saving the values. At first, I tried manually finding all the relevant MPI.Comm's, and calling .Free() on each of them. I definitely made progress this way (my script was able to process maybe 4 times as many elements before crashing), but then I stumbled onto this solution (roughly), which lets me load all 60k meshes and function data:
...
V = dolfinx.fem.functionspace(...)
u = Function(V)
temp = np.load(fn)
u.x.array[:] = temp
...
If instead I perform the load on a single line:
u.x.array[:] = np.load(fn)
I get the "too many communicators" error. Hope this is helpful to anyone facing this issue.
Hi there,
I know this issue has already been discussed in several threads, but I would like to understand better the rationale behind using MPI_Comm_dup
when wrapping the MPI comm in dolfinx.
With a student of mine, we are finally (it was overdue) porting hIPPYlib (https://hippylib.github.io/) to dolfinx
, and we are quickly facing the Too many communicators issue.
Our code does not store any long list of dlx.fem.Function
but we do create quite a lot of temporary dlx.fem.Function
and dlx.fem.form
instances in our code. All of these should be destroyed when the function returns, but it seems that is not the case for us.
While this may be a problem of the garbage collection in python, it is still an issue that affects the usability of the library.
So my questions are:
Hi, while working on a time-dependent problem with @AnnaSanfilippo we encountered some segmentation faults that might be related to duplication of MPI communicators.
I have slightly changed the poisson demo to reproduce this.
Notice that the last line is commented out:
openmpi 4.1.4
) the code with the last line commented out runs correctly for 70 000 iterationsdolfinx
docker image (mpich 4.0.2
) the code with the last line commented out runs correctly for 70 000 iterationsWhen uncommenting the last line:
openmpi 4.1.4
) the code with the last uncommented runs for 32 755 (= 2^15) iterations, and then gives the errordolfinx
docker image (mpich 4.0.2
) the code with the last uncommented runs for 1016 (=2^10) iterations, and then terminates with a segfault.Why would storing many
dolfinx.fem.Function
cause MPI-related segfaults?