Interface error during extension for substructuring DD preconditioners

nabw commented 2 years ago

Hi firedrakers,

I have been working with my colleagues on an extension of Firedrake for the BDDC preconditioner, for which MATIS matrices are required. For this, the allocation procedure in PyOP2 has been changed into the following:

if self.mat_type == 'is':
    mat.createIS(((self.nrows, None), (self.ncols, None)),
                         lgmapr=rlgmap, lgmapc=clgmap,
                         comm=self.comm)   
    mat.setISPreallocation(self.sparsity.nnz, self.sparsity.onnz)
    mat.setUp()
else:
    mat.createAIJ(size=((self.nrows, None), (self.ncols, None)),
                         nnz=(self.sparsity.nnz, self.sparsity.onnz),
                         bsize=1,
                         comm=self.comm)
    mat.setLGMap(rmap=rlgmap, cmap=clgmap)
self.handle = mat

in both init_monolithic() and init_block() functions. Although the code seems to yield a correct type and allocation, some things look strange. For instance, I have modified the Helmholtz demo as a test into the following:

from firedrake import *
mesh = UnitSquareMesh(10, 10)
V = FunctionSpace(mesh, "CG", 1)
u = TrialFunction(V)
v = TestFunction(V)
f = Function(V)
x, y = SpatialCoordinate(mesh)
f.interpolate((1+8*pi*pi)*cos(x*pi*2)*cos(y*pi*2))
a = (inner(grad(u), grad(v)) + inner(u, v)) * dx
L = inner(f, v) * dx
sol = Function(V)

solve(a == L, sol, solver_parameters={'ksp_type': 'cg', 'pc_type': 'none', 'ksp_monitor':None, 'log_view':None, 'mat_type':'is'})

which gives the following error (PETSc compiled with debug):

  File "PETSc/Log.pyx", line 115, in petsc4py.PETSc.Log.EventDecorator.decorator.wrapped_func
  File "PETSc/Log.pyx", line 116, in petsc4py.PETSc.Log.EventDecorator.decorator.wrapped_func
  File "/home/barnafi/firedrake/src/firedrake/firedrake/adjoint/variational_solver.py", line 90, in wrapper
    out = solve(self, **kwargs)
  File "/home/barnafi/firedrake/src/firedrake/firedrake/variational_solver.py", line 278, in solve
    self.snes.solve(None, work)
  File "PETSc/SNES.pyx", line 590, in petsc4py.PETSc.SNES.solve
  File "PETSc/petscsnes.pxi", line 354, in petsc4py.PETSc.SNES_Jacobian
  File "/home/barnafi/firedrake/src/firedrake/firedrake/solving_utils.py", line 454, in form_jacobian
    ctx._assemble_jac()
  File "/home/barnafi/firedrake/src/firedrake/firedrake/assemble.py", line 406, in assemble
    return self.result
  File "/home/barnafi/firedrake/src/firedrake/firedrake/assemble.py", line 583, in result
    self._tensor.M.assemble()
  File "/home/barnafi/firedrake/src/PyOP2/pyop2/types/mat.py", line 897, in assemble
    self.handle.assemble()
  File "PETSc/Mat.pyx", line 1178, in petsc4py.PETSc.Mat.assemble
petsc4py.PETSc.Error: error code 73
[0] SNESSolve() at /home/barnafi/firedrake/src/petsc/src/snes/interface/snes.c:4791
[0] SNESSolve_KSPONLY() at /home/barnafi/firedrake/src/petsc/src/snes/impls/ksponly/ksponly.c:42
[0] SNESComputeJacobian() at /home/barnafi/firedrake/src/petsc/src/snes/interface/snes.c:2867
[0] MatAssemblyBegin() at /home/barnafi/firedrake/src/petsc/src/mat/interface/matrix.c:5525
[0] MatAssemblyBegin_IS() at /home/barnafi/firedrake/src/petsc/src/mat/impls/is/matis.c:2744
[0] MatAssemblyBegin() at /home/barnafi/firedrake/src/petsc/src/mat/interface/matrix.c:5516
[0] Object is in wrong state
[0] Must call MatXXXSetPreallocation(), MatSetUp() or the matrix has not yet been factored on argument 1 "mat" before MatAssemblyBegin()

For a clearer picture, I added a line in the PyOP2 mat.py file to get the matrix type, and it actually prints it twice, so that in fact _assemble_jac() is being called twice, and the error is given the second time. I have then modified the script to use only assembled objects in order to avoid the many hooks and calls from the nonlinear solver, so that the solution procedure now looks like this:

A = assemble(a)
b = assemble(L)
solve(A, sol, b, solver_parameters={'ksp_monitor':None, 'pc_type': 'nn', 'mat_type':'is'})

I use the 'nn' preconditioner because things seemed like they worked, but the matrix type wasn't being used. After this, the error is the following:

[0] KSPSolve() at /home/barnafi/firedrake/src/petsc/src/ksp/ksp/interface/itfunc.c:1078
[0] KSPSolve_Private() at /home/barnafi/firedrake/src/petsc/src/ksp/ksp/interface/itfunc.c:843
[0] KSPSetUp() at /home/barnafi/firedrake/src/petsc/src/ksp/ksp/interface/itfunc.c:407
[0] PCSetUp() at /home/barnafi/firedrake/src/petsc/src/ksp/pc/interface/precon.c:993
[0] PCSetUp_NN() at /home/barnafi/firedrake/src/petsc/src/ksp/pc/impls/is/nn/nn.c:23
[0] PCISSetUp() at /home/barnafi/firedrake/src/petsc/src/ksp/pc/impls/is/pcis.c:136
[0] Invalid argument
[0] Requires preconditioning matrix of type MATIS

so that in fact the mat_type is not being read. At this point I see two issues:

i ) The interfaces for solving a linear problem are not consistent, and ii) The assemble_jac routine is being called twice, and in the meantime it is doing something to the matrix that breaks its allocation (... apparently)

I'm not sure how to follow, as the modification we have included looks too innocent, but it is not being propagated correctly. Any ideas are welcome, thanks in advance!

Best, NB

connorjward commented 2 years ago

A = assemble(a)
b = assemble(L)
solve(A, sol, b, solver_parameters={'ksp_monitor':None, 'pc_type': 'nn', 'mat_type':'is'})

The first thing I immediately notice is that assemble(a) will give you back a matrix with a set mat_type. You should try

A = assemble(a, mat_type="is")

to force the right one.

nabw commented 2 years ago

whoops, you are right. The error is now the same (thanks!):

Traceback (most recent call last):
  File "demos/helmholtz/helmholtz2.py", line 13, in <module>
    A = assemble(a, mat_type='is')
  File "PETSc/Log.pyx", line 115, in petsc4py.PETSc.Log.EventDecorator.decorator.wrapped_func
  File "PETSc/Log.pyx", line 116, in petsc4py.PETSc.Log.EventDecorator.decorator.wrapped_func
  File "/home/barnafi/firedrake/src/firedrake/firedrake/adjoint/assembly.py", line 19, in wrapper
    output = assemble(*args, **kwargs)
  File "/home/barnafi/firedrake/src/firedrake/firedrake/assemble.py", line 101, in assemble
    return _assemble_form(expr, *args, **kwargs)
  File "/home/barnafi/firedrake/src/firedrake/firedrake/assemble.py", line 290, in _assemble_form
    return assembler.assemble()
  File "/home/barnafi/firedrake/src/firedrake/firedrake/assemble.py", line 406, in assemble
    return self.result
  File "/home/barnafi/firedrake/src/firedrake/firedrake/assemble.py", line 583, in result
    self._tensor.M.assemble()
  File "/home/barnafi/firedrake/src/PyOP2/pyop2/types/mat.py", line 896, in assemble
    self.handle.assemble()
  File "PETSc/Mat.pyx", line 1178, in petsc4py.PETSc.Mat.assemble
petsc4py.PETSc.Error: error code 73
[0] MatAssemblyBegin() at /home/barnafi/firedrake/src/petsc/src/mat/interface/matrix.c:5525
[0] MatAssemblyBegin_IS() at /home/barnafi/firedrake/src/petsc/src/mat/impls/is/matis.c:2744
[0] MatAssemblyBegin() at /home/barnafi/firedrake/src/petsc/src/mat/interface/matrix.c:5516
[0] Object is in wrong state
[0] Must call MatXXXSetPreallocation(), MatSetUp() or the matrix has not yet been factored on argument 1 "mat" before MatAssemblyBegin()

connorjward commented 2 years ago

I've done a fair bit of digging and it looks like the matrix is put into an invalid state whenever PyOP2 executes a parloop. Specifically I've found that applying the following diff makes the error go away:

diff --git a/pyop2/parloop.py b/pyop2/parloop.py
index 8384268c..bd409b5a 100644
--- a/pyop2/parloop.py
+++ b/pyop2/parloop.py
@@ -238,13 +238,6 @@ class Parloop:
                 for m in pl_arg.data:
                     m.change_assembly_state(new_state)
                 pl_arg.data.change_assembly_state(new_state)
-
-                if pl_arg.lgmaps is not None:
-                    olgmaps = []
-                    for m, lgmaps in zip(pl_arg.data, pl_arg.lgmaps):
-                        olgmaps.append(m.handle.getLGMap())
-                        m.handle.setLGMap(*lgmaps)
-                    orig_lgmaps.append(olgmaps)
         return tuple(orig_lgmaps)

     def restore_lgmaps(self, orig_lgmaps):
@@ -252,12 +245,6 @@ class Parloop:
         if not self._has_mats:
             return

-        orig_lgmaps = list(orig_lgmaps)
-        for arg, d in reversed(list(zip(self.global_kernel.arguments, self.arguments))):
-            if isinstance(arg, (MatKernelArg, MixedMatKernelArg)) and d.lgmaps is not None:
-                for m, lgmaps in zip(d.data, orig_lgmaps.pop()):
-                    m.handle.setLGMap(*lgmaps)
-
     @cached_property
     def _has_mats(self):
         return any(isinstance(a, (MatParloopArg, MixedMatParloopArg)) for a in self.arguments)

Interestingly I found that just running the code below was enough to reproduce the error:

mat.handle.setLGMap(*mat.handle.getLGMap())
mat.handle.assemble()

Hope this helps.

nabw commented 2 years ago

Thanks, there is certainly something going on with the construction of these maps. It now runs in serial, and instead in parallel it gives me this:

[0] MatAssemblyEnd() at /home/barnafi/firedrake/src/petsc/src/mat/interface/matrix.c:5610
[0] MatAssemblyEnd_IS() at /home/barnafi/firedrake/src/petsc/src/mat/impls/is/matis.c:2754
[0] MatAssemblyEnd() at /home/barnafi/firedrake/src/petsc/src/mat/interface/matrix.c:5614
[0] MatAssemblyEnd_SeqAIJ() at /home/barnafi/firedrake/src/petsc/src/mat/impls/aij/seq/aij.c:1157
[0] Petsc has generated inconsistent data
[0] Unused space detected in matrix: 76 X 76, 86 unneeded

To avoid this, I tried commenting pyop2/types/mat.py:787 here:

# mat.setOption(mat.Option.UNUSED_NONZERO_LOCATION_ERR, True)

and instead now I get some missing entries:

[0] Object is in wrong state
[0] Matrix is missing diagonal entry 55

I suspect there is something weird going on with the ghost nodes. IS matrices handle a local matrix which is mapped to the global one through the LGmap, where this matrix includes all ghost dofs. To be more clear, if the problem has vertex-based dofs in the following mesh:

x ----- x ----- x | | | x ----- x ----- x

where each of the 2 processors has one (square) element, then each local matrix will be of size 4x4. I can try looking at this if you point me to where these maps are constructed. Also, thank you for your help!

connorjward commented 2 years ago

The lgmaps are constructed here.

stefanozampini commented 2 years ago

I've done a fair bit of digging and it looks like the matrix is put into an invalid state whenever PyOP2 executes a parloop. Specifically I've found that applying the following diff makes the error go away:

diff --git a/pyop2/parloop.py b/pyop2/parloop.py
index 8384268c..bd409b5a 100644
--- a/pyop2/parloop.py
+++ b/pyop2/parloop.py
@@ -238,13 +238,6 @@ class Parloop:
                 for m in pl_arg.data:
                     m.change_assembly_state(new_state)
                 pl_arg.data.change_assembly_state(new_state)
-
-                if pl_arg.lgmaps is not None:
-                    olgmaps = []
-                    for m, lgmaps in zip(pl_arg.data, pl_arg.lgmaps):
-                        olgmaps.append(m.handle.getLGMap())
-                        m.handle.setLGMap(*lgmaps)
-                    orig_lgmaps.append(olgmaps)
         return tuple(orig_lgmaps)

     def restore_lgmaps(self, orig_lgmaps):
@@ -252,12 +245,6 @@ class Parloop:
         if not self._has_mats:
             return

-        orig_lgmaps = list(orig_lgmaps)
-        for arg, d in reversed(list(zip(self.global_kernel.arguments, self.arguments))):
-            if isinstance(arg, (MatKernelArg, MixedMatKernelArg)) and d.lgmaps is not None:
-                for m, lgmaps in zip(d.data, orig_lgmaps.pop()):
-                    m.handle.setLGMap(*lgmaps)
-
     @cached_property
     def _has_mats(self):
         return any(isinstance(a, (MatParloopArg, MixedMatParloopArg)) for a in self.arguments)

Interestingly I found that just running the code below was enough to reproduce the error:

mat.handle.setLGMap(*mat.handle.getLGMap())
mat.handle.assemble()

Hope this helps.

@wence- What is the reason behind changing the l2gmaps here? is it for a specific matrix type? or what? when you change the l2gmaps, MATIS destroys the old local matrices and recreates new ones, thus throwing away any preallocation done at init time.

connorjward commented 2 years ago

@wence- What is the reason behind changing the l2gmaps here? is it for a specific matrix type? or what? when you change the l2gmaps, MATIS destroys the old local matrices and recreates new ones, thus throwing away any preallocation done at init time.

We swap out the lgmaps for the case where we apply boundary conditions and want to mask certain nodes from being written to. ~~I think there is a bug in assemble where we always do this swap even if there are no BCs (so the old and new lgmaps are the same)~~ (see #2487).

stefanozampini commented 2 years ago

@wence- What is the reason behind changing the l2gmaps here? is it for a specific matrix type? or what? when you change the l2gmaps, MATIS destroys the old local matrices and recreates new ones, thus throwing away any preallocation done at init time.

We swap out the lgmaps for the case where we apply boundary conditions and want to mask certain nodes from being written to. I think there is a bug in assemble where we always do this swap even if there are no BCs (so the old and new lgmaps are the same).

masking as skipping values insertion? You could use negative entries for that. If bc locations are not dynamic (i.e. indices do not change during subsequent assemblies) then you should be fine.

connorjward commented 2 years ago

If I'm reading the code correctly then that is exactly what we do. Possibly the issue is that if we apply boundary conditions then we actually create a new lgmap and swap it in rather than modifying the one already in use.

wence- commented 2 years ago

Yes, that's right. TBH, I think a neater way to make the matis structure would be to intercept the matrix construction and make seqaij neumann matrices on each process, then assemble the global dirichlet matrix as normal before and make the matis you need by hand.

as @connorjward says we modify the lgmaps to apply boundary conditions, but this doesn't play well with matis (which takes change of lgmap as a sign that everything is invalidated). When I looked at this about 3 years ago I didn't manage to figure out a good solution

wence- commented 2 years ago

@wence- What is the reason behind changing the l2gmaps here? is it for a specific matrix type? or what? when you change the l2gmaps, MATIS destroys the old local matrices and recreates new ones, thus throwing away any preallocation done at init time.

We swap out the lgmaps for the case where we apply boundary conditions and want to mask certain nodes from being written to. I think there is a bug in assemble where we always do this swap even if there are no BCs (so the old and new lgmaps are the same).

masking as skipping values insertion? You could use negative entries for that. If bc locations are not dynamic (i.e. indices do not change during subsequent assemblies) then you should be fine.

yes, negative values. but we don't condense out dirichlet rows, so afterwards we must insert identity on the diagonal, which is what i recall used to cause problems with matis

nabw commented 2 years ago

Although it was a brutally different context, one way I have seen this problem solved is to put a 1/#numOwningProcs in each Dirichlet dof. This will result in an identity upon assembly, and allows for considering dirichlet bcs in MATIS. We would only require this type of info for the dofs, which perhaps you already do.

firedrakeproject / firedrake

Interface error during extension for substructuring DD preconditioners #2485