b = A*x is not particularly fast for A isa BlockSkylineMatrix, at least not for the block sizes that commonly arise in FE-DVR. Maybe representing A as a BandedMatrix (thus explicitly storing structural zeros) could improve arithmetic performance due to cache-friendliness? When we get to distributed computing, we could similarly use a BandedBlockBandedMatrix, where each node essentially has a banded matrix with some number of finite elements stored in it, and the only communication between nodes would be the bridge function between two finite elements, i.e. a single scalar. I assume an incomplete factorization could be formed by factorizing each banded matrix, and this could be used as e.g. a preconditioner.
b = A*x
is not particularly fast forA isa BlockSkylineMatrix
, at least not for the block sizes that commonly arise in FE-DVR. Maybe representingA
as aBandedMatrix
(thus explicitly storing structural zeros) could improve arithmetic performance due to cache-friendliness? When we get to distributed computing, we could similarly use aBandedBlockBandedMatrix
, where each node essentially has a banded matrix with some number of finite elements stored in it, and the only communication between nodes would be the bridge function between two finite elements, i.e. a single scalar. I assume an incomplete factorization could be formed by factorizing each banded matrix, and this could be used as e.g. a preconditioner.