Specializing on BandedBlockBandedMatrices

matthieugomez commented 5 years ago

BandedBlockBandedMatrix is a common sparsity structure for Jacobians (for instance, it appears when solving PDEs using finite difference methods). SparseDiffTools makes it very fast to color BandedBlockBandedMatrices. However, using BandedBlockBandedMatrices can introduce slowdown in other parts of the code (https://github.com/JuliaDiffEq/DiffEqDiffTools.jl/issues/67) This issue is to check that BandedBlockBandedMatrices are as performant as possible through the entire system. @ChrisRackauckas @huanglangwen

huanglangwen commented 5 years ago

  while i-ind.refinds[p]>=0
    p+=1
  end

in ArrayInterface.jl:306 consumes huge amount of time. It's a bad search algorithm. Maybe better using binary search in searchsortedfirst . In the end, I don't think it will be faster than sparse matrix as long as we still use getindex .

ChrisRackauckas commented 5 years ago

Maybe @dlfivefifty might have a good idea for how to specialize iterating through BlockBandedMatrix structures?

dlfivefifty commented 5 years ago

Do you mean BandedBlockBandedMatrix or BlockBandedMatrix? The latter just orders data by column, but I assume for FD you actually want BandedBlockBandedMatrix. This is a block version of the standard BLAS banded format:

julia> A = BandedBlockBandedMatrix{Int}(undef, (fill(3,3),fill(3,3)), (1,1), (1,1)); vec(A.data) .= 1:length(A.data); A
3×3-blocked 9×9 BandedBlockBandedMatrix{Int64,BlockArrays.PseudoBlockArray{Int64,2,Array{Int64,2},BlockArrays.BlockSizes{2,Tuple{Array{Int64,1},Array{Int64,1}}}}}:
 5  13   ⋅  │  29  37   ⋅  │   ⋅   ⋅   ⋅
 6  14  22  │  30  38  46  │   ⋅   ⋅   ⋅
 ⋅  15  23  │   ⋅  39  47  │   ⋅   ⋅   ⋅
 ───────────┼──────────────┼────────────
 8  16   ⋅  │  32  40   ⋅  │  56  64   ⋅
 9  17  25  │  33  41  49  │  57  65  73
 ⋅  18  26  │   ⋅  42  50  │   ⋅  66  74
 ───────────┼──────────────┼────────────
 ⋅   ⋅   ⋅  │  35  43   ⋅  │  59  67   ⋅
 ⋅   ⋅   ⋅  │  36  44  52  │  60  68  76
 ⋅   ⋅   ⋅  │   ⋅  45  53  │   ⋅  69  77

julia> A.data
3×3-blocked 9×9 BlockArrays.PseudoBlockArray{Int64,2}:
 1  10  19  │  28  37  46  │  55  64  73
 2  11  20  │  29  38  47  │  56  65  74
 3  12  21  │  30  39  48  │  57  66  75
 ───────────┼──────────────┼────────────
 4  13  22  │  31  40  49  │  58  67  76
 5  14  23  │  32  41  50  │  59  68  77
 6  15  24  │  33  42  51  │  60  69  78
 ───────────┼──────────────┼────────────
 7  16  25  │  34  43  52  │  61  70  79
 8  17  26  │  35  44  53  │  62  71  80
 9  18  27  │  36  45  54  │  63  72  81

matthieugomez commented 5 years ago

Yes, I meant BandedBlockBandedMatrix.

huanglangwen commented 5 years ago

But how to iterate on the nonzero values of A? @dlfivefifty

dlfivefifty commented 5 years ago

I normally use the fact that view(A,Block(K,J)) conforms to the banded matrix interface. Though you can do it explicitly via:

l,u = blockbandwidths(A)
λ,μ = subblockbandwidths(A)
M,N = nblocks(A)
for J=1:N, K=blockcolrange(A,J)
    V = view(A, Block(K,J)) 
    m,n = size(V)
    for j=1:n, k=colrange(V,j)
         V[k,j] = ...
    end 
end

This could be probably cleaned up as an iterator but not quite sure how that would look. You can also use K=max(1,J-u):min(M,J+l) and k=max(1,j-μ):min(M,j+λ) if you want to be explicit.

ChrisRackauckas commented 5 years ago

From those PRs, it looks like there's a dilemma:

Generating color vectors from BlockBandedMatrices is super fast compared to a general sparse matrix.
Filling and decompressing a Jacobian with colored differentiation is slower than a sparse matrix because it's not column ordered
I'd assume that LU factorization on the structured matrix is faster.

dlfivefifty commented 5 years ago

Note LU factorization does not exist (yet) for BlockBandedMatrix. It's a bit difficult to implement due to managing the pivoting.

QR factorization does exist. It knocks the socks off UMFPack's QR, but is in the same ballpark as UMFPack's LU with pivoting. Memory usage is a potential saving here for ill-conditioned problems, and distributed memory is feasible. (So is multithreading but its already calling multithreaded LAPack so unlikely to help much.)

Cholesky could exist and is potentially very fast, but I haven't got around to implementing it. (Should work just like QR implementation.) I'm not sure if you can exploit positive definiteness in Jacobians.

I don't understand your point 2.

ChrisRackauckas commented 5 years ago

Point 2 is just https://github.com/JuliaDiffEq/ArrayInterface.jl/pull/23 and https://github.com/JuliaDiffEq/DiffEqDiffTools.jl/pull/74 , unless you have a better idea for how to do the decompression.

dlfivefifty commented 5 years ago

Ok will have to look.

Generally speaking, broadcast should work great for banded/banded-block-banded matrices. So if its at all possible to avoid for loops and use broadcasting that is preferable. Note provided f(0) == 0 broadcasting will preserve bandwidth and only iterate over non-zeros, and it does tricks that are hard by hand to reduce to the underlying data storage, essentially broadcasting over a regular array.

huanglangwen commented 5 years ago

Is it possible to broadcast while keep track of (global) row/column number like f(aij, i, j) ?

ChrisRackauckas commented 5 years ago

See figure 1: https://pdfs.semanticscholar.org/ae4d/65769b6551a51d6fc6be2f021515bffa0798.pdf

The color decompression problem is simplified as follows. We have a color vector [1,2,3,1,1,2,3] of terms we can differentiate at the same time. This works because numerical/forward differentiation calculates one column at a time, so if due to the sparsity there are terms which do not connect (term 1 doesn't appear in any equation with term 4 or 5), then there will be no perturbation confusion and so you can get an accurate derivative of 1, 4, and 5 at the same time. Due to the sparsity pattern of the Jacobian, there will only be one column which has that row non-zero, so we can decompress. In Figure 1, this is us calculating the green column in (c), and then pulling it out to get two columns of the Jacobian in (b).

However, the issue is how to iterate through this column effectively when (b) is a BlockBandedMatrix. It works really well with SparseMatrixCSC but we haven't found an effective way on this structure.

dlfivefifty commented 5 years ago

Is it possible to broadcast while keep track of (global) row/column number like f(aij, i, j) ?

in theory f.(A, 1:m, (1:n)’) but it’s unlikely to be fast without work.

dlfivefifty commented 5 years ago

It’s also surprising that you’d want the global indices, for FD these don’t really tell you anything about the matrix entries.

huanglangwen commented 5 years ago

global column index is needed to figure out the color of the column, and global row index is needed for retrieving values from differentiated vector.

dlfivefifty commented 5 years ago

It sounds like there should be block version of colouring, so instead of column j we have Block(J)[j] which gets the j-th column of the J-th block column. Let me read the paper and think

ChrisRackauckas commented 5 years ago

The color vector will definitely coincide with the block structure, i.e. you can have a 1 in each block for sure, since they definitely won't have perturbation confusion. The issue is then the compression, since you have at least one calculated value in each block, so iterating it isn't necessarily nice. There's probably a better way than what we were doing though.

dlfivefifty commented 5 years ago

I think it's best to start with banded matrices and the banded-block-banded matrix version should be straightforward. It seems like creating structurally orthogonal columns is straightforward:

l,u = bandwidths(A)
cols = [j:(l+u+1):size(A,2) for j=1:l+u]

To translate this to banded-block-banded, I think we can iterate. That is, we split the blocks as in banded, and then within each block split according to blocks.

l,u = blockbandwidths(A)
λ,μ = subblockbandwidths(A)
blockcols = [j:(l+u+1):nblocks(A,2) for j=1:l+u]

[
[begin
    n = blocksize(A,1,J) # number of cols in J-th block;
    [Block(J)[j:(λ+μ+1):n] for j=1:λ+μ]
   end for J in blockcols[bcs]] for bcs in blockcols]

Unfortunately this is hitting some missing functionality in block indexing so I'll need to do some bug fixes in BlockArrays.jl.

huanglangwen commented 5 years ago

This iteration method made an assumption that the colorvec is like repeat(1:(l+u+1))[1:size(A,2)] , but actually it is given by the user.

ChrisRackauckas commented 5 years ago

Well the colorvec for BandedBlockBanded will at least be "supersetted" by repeat(1:(l+u+1))[1:size(A,2)], with that being the minimal number of repeats.

dlfivefifty commented 5 years ago

I'm a bit confused. My general point is if the Jacobian is a BandedBlockBandedMatrix then probably the variables themselves are best thought of as a BlockVector, in which case indexing by block-columns is very natural.

Perhaps there's a MWE that I can look at optimising?

huanglangwen commented 5 years ago

What's a BlockVector ?

Here is a MWE: (requires ArrayInterface@1.3.0)

using ArrayInterface, BlockBandedMatrices, BenchmarkTools, SparseArrays
J = BandedBlockBandedMatrix(Ones(10000, 10000), (fill(100, 100), fill(100, 100)), (1, 1), (1, 1))
Jsparse = sparse(J)
vfx1=ones(10000)
function f(J,vfx1)
    for col_index in 1:size(J,2)
        #if color[col_index] == color_i #ignored for simplicity
        for row_index in ArrayInterface.findstructralnz(J,col_index) #J is a BBB matrix
            J[row_index,col_index]=vfx1[row_index]
        end
        #end
    end
end
@btime f(J,vfx1)
@btime f(Jsparse,vfx1)

That's generally the code that I'm using now. It seems J[row_index,col_index]=vfx1[row_index] is doing lots of allocations.

dlfivefifty commented 5 years ago

How do I get ArrayInterface@1.3.0?

huanglangwen commented 5 years ago

Ah, sorry. The PR haven't been merged yet @ChrisRackauckas. Currently, you could use this https://github.com/JuliaDiffEq/ArrayInterface.jl/pull/23 .

dlfivefifty commented 5 years ago

Hmm, I'm having trouble understanding how SparseArrays is so fast. Here's my optimised code:

using BandedMatrices
cs = BlockArrays.BlockSizes((BlockArrays.cumulsizes(J,2),)) # column block sizes
b = BlockArray(vfx1,cs) # impose block structure
function g(A,b)
    λ,μ = subblockbandwidths(A)
    @inbounds for J=Block.(1:nblocks(A,2)), K=BlockBandedMatrices.blockcolrange(A,J)
        V = view(A,K,J)
        b_v = b.blocks[K.n[1]]
        data = BlockBandedMatrices.bandeddata(V)
        p = pointer(data)
        st = stride(data,2)
        m,n = size(V)
        for j=1:n,k=max(1,j-μ):min(m,j+λ)
            unsafe_store!(p, b_v[k], (j-1)*st + μ + k - j + 1)
        end
    end
end

It's still 100μs, so roughly 200x slower than sparse!

Half of the time is running max(1,j-μ):min(m,j+λ), the other half doing the unsafe_store!. These shouldn't be expensive operations...

dlfivefifty commented 5 years ago

I got it down to 37μs

function g(A,b)
    @inbounds for J=Block.(1:nblocks(A,2)), K=BlockBandedMatrices.blockcolrange(A,J)
        V = view(A,K,J)
        b_v = b.blocks[K.n[1]]
        data = BlockBandedMatrices.bandeddata(V)
        p = pointer(data)
        st = stride(data,2)
        m,n = size(V)
        unsafe_store!(p, b_v[1], 2)
        unsafe_store!(p, b_v[2], 3)
        for j=2:n-1
            jst = (j-1)*st
            unsafe_store!(p, b_v[j-1], jst  + 1)
            unsafe_store!(p, b_v[j], jst + 2)
            unsafe_store!(p, b_v[j+1], jst + 3)
        end
        jst = (n-1)*st
        unsafe_store!(p, b_v[n-1], jst + 1)
        unsafe_store!(p, b_v[n], jst + 2)
    end
end

Now half the cost is in getindex and the other half in unsafe_store!. I guess I'm not accessing memory "optimally": I probably need to go down columns, not go by block.

huanglangwen commented 5 years ago

Why following columns is extremely slow? It's around 1ms!! Then it is strange that the first version using findstructralnz(J,col_index) is the fastest, only 5μs.

function g(A,b)
    λ,μ = subblockbandwidths(A)
    @inbounds for J=Block.(1:nblocks(A,2))
        blockcolrange=BlockBandedMatrices.blockcolrange(A,J)
        _,n=BlockBandedMatrices.blocksize(A,(blockcolrange[1].n[1],J.n[1]))
        @inbounds for j=1:n
            @inbounds for K=blockcolrange
                V = view(A,K,J)
                b_v = b.blocks[K.n[1]]
                data = BlockBandedMatrices.bandeddata(V)
                p = pointer(data)
                st = stride(data,2)
                m,n = size(V)
                @inbounds for k=max(1,j-μ):min(m,j+λ)
                    unsafe_store!(p, b_v[k], (j-1)*st + μ + k - j + 1)
                end
            end
        end
    end
end

dlfivefifty commented 5 years ago

It might be b as a BlockVector causing part of the problem. I'll have to do some more experimenting.

huanglangwen commented 5 years ago

I'm terribly sorry that the sixth line of MWE should be for col_index in 1:size(J,2) instead of for col_index in 1:length(size(J,2)) . As a consequence, the baseline for CSC matrix is 1.2 ms and @dlfivefifty your code is extremely fast!!! Though, @YingboMa suggests not to use unsafe_store! or it could destroy alias analysis, so that vectorization isn’t possible.

dlfivefifty commented 5 years ago

Ah ok , so I’m not going crazy!

I wasn’t serious about unsafe_store!, just wanted to get rid of all overhead due to views to do the timing.

ChrisRackauckas commented 5 years ago

Cheers!

https://github.com/JuliaDiffEq/DiffEqDiffTools.jl/pull/74 https://github.com/JuliaDiffEq/SparseDiffTools.jl/pull/61

This is quite an amazing feature for systems of PDEs and I can't find this auto-specialization anywhere else. Awesome work @huanglangwen and @dlfivefifty !

SciML / DifferentialEquations.jl

Specializing on BandedBlockBandedMatrices #483