[BUG]: VTXWriter outputs erroneous data in parallel when outputting collapsed space

nate-sime commented 11 months ago

Summarize the issue

Using VTXWriter to output a DG function and one of its collapsed sub functions yields incorrect data in parallel.

How to reproduce the bug

The following MWE and attached paraview rendering illustrates the issue. The image is produced after running with 5 MPI processes.

Minimal Example (Python)

import numpy as np
from mpi4py import MPI
import dolfinx

mesh = dolfinx.mesh.create_unit_square(MPI.COMM_WORLD, 4, 4)

V = dolfinx.fem.VectorFunctionSpace(mesh, ("DG", 1), dim=2)
u_vec = dolfinx.fem.Function(V, name="u")

u_vec.interpolate(lambda x: np.stack((x[0], x[1])))
u_vec.x.scatter_forward()
u1 = u_vec.sub(0).collapse()
u1.x.scatter_forward()
u1.name = "u1"

# Fails in parallel with odd ghosting issues
with dolfinx.io.VTXWriter(
        mesh.comm, "solution.bp", [u_vec, u1], engine="BP4") as vtx:
    vtx.write(0.0)

# Works fine
with dolfinx.io.VTXWriter(
        mesh.comm, "solution_u1.bp", [u1], engine="BP4") as vtx:
    vtx.write(0.0)

Output (Python)

No response

Version

0.7.2

DOLFINx git commit

No response

Installation

From source

Additional information

jorgensd commented 11 months ago

Same issue on main with 3 processes and:

import numpy as np
from mpi4py import MPI
import dolfinx

mesh = dolfinx.mesh.create_unit_square(MPI.COMM_WORLD, 2, 2)

V = dolfinx.fem.functionspace(mesh, ("DG", 1, (2, )))
u_vec = dolfinx.fem.Function(V, name="u")

u_vec.interpolate(lambda x: np.stack((x[0], x[1])))
u_vec.x.scatter_forward()
u1 = u_vec.sub(0).collapse()
u1.x.scatter_forward()
u1.name = "u1"

# Fails in parallel with odd ghosting issues
with dolfinx.io.VTXWriter(
        mesh.comm, "solution.bp", [u_vec, u1], engine="BP4") as vtx:
    vtx.write(0.0)

# Works fine
with dolfinx.io.VTXWriter(
        mesh.comm, "solution_u1.bp", [u1], engine="BP4") as vtx:
    vtx.write(0.0)

Seems like a switch in cells:Screenshot from 2023-12-13 08-53-46

Also fails with 3 processes and Lagrange:

import numpy as np
from mpi4py import MPI
import dolfinx

mesh = dolfinx.mesh.create_unit_square(
    MPI.COMM_WORLD, 2, 2, ghost_mode=dolfinx.mesh.GhostMode.none)

V = dolfinx.fem.functionspace(mesh, ("Lagrange", 1, (2, )))
u_vec = dolfinx.fem.Function(V, name="u")

u_vec.interpolate(lambda x: np.stack((x[0], x[1])))
u_vec.x.scatter_forward()
u1 = u_vec.sub(0).collapse()
u1.x.scatter_forward()
u1.name = "u1"

# Fails in parallel with odd ghosting issues
with dolfinx.io.VTXWriter(
        mesh.comm, "solution.bp", [u_vec, u1], engine="BP4") as vtx:
    vtx.write(0.0)

# Works fine
with dolfinx.io.VTXWriter(
        mesh.comm, "solution_u1.bp", [u1], engine="BP4") as vtx:
    vtx.write(0.0)

jorgensd commented 11 months ago

To me it looks like it relates to the fact that the ghosts are ordered differently in the collapsed space. Is this on purpose, @garth-wells @chrisrichardson @jpdean any comments?

import numpy as np
from mpi4py import MPI
import dolfinx

mesh = dolfinx.mesh.create_unit_square(
    MPI.COMM_WORLD, 2, 2, ghost_mode=dolfinx.mesh.GhostMode.none)

V = dolfinx.fem.functionspace(mesh, ("Lagrange", 1, (2, )))
V0, _ = V.sub(0).collapse()

print(V.dofmap.index_map.ghosts,
      V0.dofmap.index_map.ghosts)

[1] [1]
[5 7 6 0] [0 5 7 6]
[6] [6]

francesco-ballarin commented 11 months ago

To me it looks like it relates to the fact that the ghosts are ordered differently in the collapsed space. Is this on purpose, @garth-wells @chrisrichardson @jpdean any comments? From @jorgensd

I think we agreed in https://github.com/FEniCS/dolfinx/issues/2881 that that was on purpose.

jorgensd commented 11 months ago

To me it looks like it relates to the fact that the ghosts are ordered differently in the collapsed space. Is this on purpose, @garth-wells @chrisrichardson @jpdean any comments? From @jorgensd

I think we agreed in #2881 that that was on purpose.

I agree that in the case of using create_sub_map, it doesn't make sense to have that special handling. However, I find it strange that collapsing a function space reorders the dofs, when we explicitly do not want to re-order all indices when having such a submap. https://github.com/FEniCS/dolfinx/blob/main/cpp/dolfinx/fem/DofMap.cpp#L26-L83 The function states that it doesn't reorder the dofmap, however, that isn't true, as the ghost indices are reordered.

To me, this suggests that calling create_sub_map is way overkill for collapsing a blocked function space (as you then retain all ghost degrees of freedom). I guess whenever you have proper mixed spaces, it will be tricker to find the subset of indices and ghosts, and some reordering would be expected. I still wonder if we even in that case could do this without any communication, as you know that you are retaining all dofs local in the sub-space, similarly all ghosts are kept (in the subspace).

garth-wells commented 11 months ago

It's not clear to me that the ghost ordering is the problem. Re-mapping takes place when copying entries from the vector at https://github.com/FEniCS/dolfinx/blob/227f1d3cdd52a7b08a5b01c37f395ecb0d9fdb7d/cpp/dolfinx/fem/Function.h#L126.

jorgensd commented 11 months ago

As the issue shows, they are copied over in the correct order, ie, if you only write the collapsed function to file, the ordering is correct. However, if you write the parent space and collapsed sub space to the same file, we see that the dof maps are not the same, even if the Finite elements are the same (up to block size) and the dof layout is the same for both spaces.

jorgensd commented 11 months ago

It's not clear to me that the ghost ordering is the problem. Re-mapping takes place when copying entries from the vector at

https://github.com/FEniCS/dolfinx/blob/227f1d3cdd52a7b08a5b01c37f395ecb0d9fdb7d/cpp/dolfinx/fem/Function.h#L126

.

You can also see this with the PR that I have made, as it resolves the issues described here

garth-wells commented 11 months ago

The root problem still hasn't been properly identified. Is it that VTXWriter requires all functions to have the same dofmap, and in the MWE the maps are not the same?

jorgensd commented 11 months ago

The root problem still hasn't been properly identified. Is it that VTXWriter requires all functions to have the same dofmap, and in the MWE the maps are not the same?

Yes, VTXWriter assumes that the dofmaps are the same (as there is a 1-1 map between the mesh geometry dofs and the function dofs) This was previously the case for blocked spaces, but changed whenever create_submap was introduced in dofmap.collapse()`.

there are thee ways of «fixing» this: Option 1: change create submap as Ive done in #2930 Option 2: Do not allow sub spaces of blocked spaces in the same VTXWriter (has the drawbacl that one would have to save the mesh N times instead of 1). Option 3: remove create_submap from dofmap.collapse and go back to a less general implementation that keeps the order of the ghosts.

garth-wells commented 10 months ago

The documentation specifies that the elements for the different functions must be the same, whereas the implementation requires that the element and the dofmap are the same. Immediate issues to be fixed are:

Raise an exception (at least in debug mode) in VTXWriter if the elements and/or dofmaps are not the same.
Fix the documentation.

A second step could be:

Support cases with the same element but different dofmaps.

It is best if this class imposes as few requirements as reasonably possible on the arguments.

jorgensd commented 10 months ago

Support cases with the same element but different dofmaps.

This means writing the mesh twice, or compute the permutation/mapping from one dofmap to the other, which I am not sure is worthwhile complexity.

i still think this is something that should be fixed collapsed function space code. It is silly that we keep the same numbering for the local dofs, but not the ghosts.

FEniCS / dolfinx