Open jedbrown opened 7 years ago
Let me see if my understanding is correct: MatGetSubMatrix
returns a parallel matrix living on the original communicator, and MatGetSubMatrices
returns an array of sequential matrices on each process (and in our case, we only need one sequential matrix, i.e. the length of the array will be one). So MatGetSubMatrix
requires more memory because a parallel matrix uses more memory to store additional information even on processes owning no rows?
So MPIAIJ
matrices are stored with the diagonal block separate from the off-diagonal block (to overlap communication of the halo needed by the vector). MatMPIAIJGetLocalMat
takes that partitioned representation and gives you a single matrix with global column space -- that is a copy. But inside the implementation of MatGetSubMatrix_MPIAIJ
, the usage here is currently going to take a path that calls a form of MatGetSubMatrices
and converts the unpartitioned representation into MPIAIJ
. If you call MatGetSubMatrices
directly, you eliminate the memory and overhead of converting the submatrices to partitioned format and back to unpartitioned.
The present use of
MatGetSubMatrix
looks awkward and like it wastes significant memory, particularly when there are several MPI processes per device.MatGetSubMatrices
does more precisely what you want -- returning a sequential matrix with full rows only on the processes that request it (processes that are not part ofgpuWorld
can pass empty index sets).The use of
VecGetArray
for the RHS will be problematic in case the RHS has been "locked" (e.g., by a higher level solve in PETSc). I recommendVecGetArrayRead
.