BUG: VTKFile fails in parallel after creating a mesh with mpi.COMM_SELF in a single process

diego-hayashi commented 2 weeks ago

Describe the bug When running in parallel, after creating a mesh in process 0 with mpi.COMM_SELF, if VTKFile is used for another mesh in mpi.COMM_WORLD, the generation of the file results in missing data, according to ParaView.

Steps to Reproduce

Download the .msh file from https://github.com/firedrakeproject/firedrake/blob/master/docs/notebooks/stokes-control.msh.

Create a file (test.py) with the following content:

from firedrake import *
mesh = Mesh('stokes-control.msh', comm = mpi.COMM_WORLD)
if mpi.COMM_WORLD.Get_rank() == 0:
mesh_serial = Mesh('stokes-control.msh', comm = mpi.COMM_SELF)
U = FunctionSpace(mesh, 'CG', 1)
u = Function(U)
VTKFile('u.pvd').write(u)

Then, run it as:
```
$ mpiexec -n 2 python -u test.py
```

Expected behavior The generated u.pvd file should have been generated such that ParaView does not say that there is missing data, and the u variable should have been plotted with all zero values.

Error message The Firedrake code does not show any error message. However, the error message from ParaView after opening the .pvd file is that a piece point data array was not found:

(   6.404s) [paraview        ]  vtkXMLPDataReader.cxx:359    ERR| vtkXMLPUnstructuredGridReader (0x5595cf53fed0): Piece point data array function_3 not found

Also, half of the plot in ParaView (in the case of 2 processes) shows an odd behavior, rather than showing the value 0 consistently.

Environment:

OS: Linux
Python version: 3.11.6
ParaView version: 5.11.1

Additional Info If the code from step 2 is changed to generate the mpi.COMM_SELF mesh in both processes (which is not what we want), rather than just in process 0, everything seems to work fine.

stephankramer commented 2 weeks ago

I think the issue is that you didn't name your function. If you don't specify a name, firedrake gives it some auto-generated function name 'function_XX' - but because your code-paths are not the same the numbering for these are out-of-sync, and so on one process it write a function under the name function_03 and on the other it's written as function_04 and that's how it ends up in the respective .vtus. In the pvtu however it writes whatever process 0 thinks the name is. If you simply change to:

u = Function(U, name='foo')

thing seem to work as intended.

diego-hayashi commented 2 weeks ago

It seems to work fine when you name the function beforehand, as you suggested. Nonetheless, it still feels like the automatic function naming should consider the communicator, since there are cases such as:

VTKFile('mesh.pvd').write(mesh)

in which Firedrake internally creates an auxiliary function, which we have no access to, in order to help plotting the mesh, and the error happens once again.

wence- commented 2 weeks ago

It seems to work fine when you name the function beforehand, as you suggested. Nonetheless, it still feels like the automatic function naming should consider the communicator, since there are cases such as:
VTKFile('mesh.pvd').write(mesh)
in which Firedrake internally creates an auxiliary function, which we have no access to, in order to help plotting the mesh, and the error happens once again.

That is a good suggestion. The way to handle this is to adapt _new_uid https://github.com/firedrakeproject/firedrake/blob/cb77d32ac8559d8d57fe0ef0efe251d9cbaf2309/firedrake/utils.py#L23

It should take a communicator (since it's only ever called in a context where a communicator is available), pull out the firedrake "internal" communicator (cc @JDBetteridge), and use that to key an id generation. This can be done by attaching an attribute to the communicator that is the uid counter. Since the calls to _new_uid are collective over a communicator, pulling the uid out of the communicator attribute and updating it will be safe in this scenario.

connorjward commented 2 weeks ago

As this is a slight tangent to the initial reported bug @diego-hayashi could you create a new issue for this?

wence- commented 2 weeks ago

As this is a slight tangent to the initial reported bug @diego-hayashi could you create a new issue for this?

Done.

firedrakeproject / firedrake

BUG: VTKFile fails in parallel after creating a mesh with mpi.COMM_SELF in a single process #3628