Alternate BandedMatrix/BandedBlockBandedMatrix storage for FE-DVR operators

b = A*x is not particularly fast for A isa BlockSkylineMatrix, at least not for the block sizes that commonly arise in FE-DVR. Maybe representing A as a BandedMatrix (thus explicitly storing structural zeros) could improve arithmetic performance due to cache-friendliness? When we get to distributed computing, we could similarly use a BandedBlockBandedMatrix, where each node essentially has a banded matrix with some number of finite elements stored in it, and the only communication between nodes would be the bridge function between two finite elements, i.e. a single scalar. I assume an incomplete factorization could be formed by factorizing each banded matrix, and this could be used as e.g. a preconditioner.

JuliaApproximation / CompactBases.jl

Alternate BandedMatrix/BandedBlockBandedMatrix storage for FE-DVR operators #25