Closed matthieugomez closed 5 years ago
while i-ind.refinds[p]>=0
p+=1
end
in ArrayInterface.jl:306 consumes huge amount of time. It's a bad search algorithm. Maybe better using binary search in searchsortedfirst
. In the end, I don't think it will be faster than sparse matrix as long as we still use getindex
.
Maybe @dlfivefifty might have a good idea for how to specialize iterating through BlockBandedMatrix structures?
Do you mean BandedBlockBandedMatrix
or BlockBandedMatrix
? The latter just orders data by column, but I assume for FD you actually want BandedBlockBandedMatrix
. This is a block version of the standard BLAS banded format:
julia> A = BandedBlockBandedMatrix{Int}(undef, (fill(3,3),fill(3,3)), (1,1), (1,1)); vec(A.data) .= 1:length(A.data); A
3×3-blocked 9×9 BandedBlockBandedMatrix{Int64,BlockArrays.PseudoBlockArray{Int64,2,Array{Int64,2},BlockArrays.BlockSizes{2,Tuple{Array{Int64,1},Array{Int64,1}}}}}:
5 13 ⋅ │ 29 37 ⋅ │ ⋅ ⋅ ⋅
6 14 22 │ 30 38 46 │ ⋅ ⋅ ⋅
⋅ 15 23 │ ⋅ 39 47 │ ⋅ ⋅ ⋅
───────────┼──────────────┼────────────
8 16 ⋅ │ 32 40 ⋅ │ 56 64 ⋅
9 17 25 │ 33 41 49 │ 57 65 73
⋅ 18 26 │ ⋅ 42 50 │ ⋅ 66 74
───────────┼──────────────┼────────────
⋅ ⋅ ⋅ │ 35 43 ⋅ │ 59 67 ⋅
⋅ ⋅ ⋅ │ 36 44 52 │ 60 68 76
⋅ ⋅ ⋅ │ ⋅ 45 53 │ ⋅ 69 77
julia> A.data
3×3-blocked 9×9 BlockArrays.PseudoBlockArray{Int64,2}:
1 10 19 │ 28 37 46 │ 55 64 73
2 11 20 │ 29 38 47 │ 56 65 74
3 12 21 │ 30 39 48 │ 57 66 75
───────────┼──────────────┼────────────
4 13 22 │ 31 40 49 │ 58 67 76
5 14 23 │ 32 41 50 │ 59 68 77
6 15 24 │ 33 42 51 │ 60 69 78
───────────┼──────────────┼────────────
7 16 25 │ 34 43 52 │ 61 70 79
8 17 26 │ 35 44 53 │ 62 71 80
9 18 27 │ 36 45 54 │ 63 72 81
Yes, I meant BandedBlockBandedMatrix
.
But how to iterate on the nonzero values of A? @dlfivefifty
I normally use the fact that view(A,Block(K,J))
conforms to the banded matrix interface. Though you can do it explicitly via:
l,u = blockbandwidths(A)
λ,μ = subblockbandwidths(A)
M,N = nblocks(A)
for J=1:N, K=blockcolrange(A,J)
V = view(A, Block(K,J))
m,n = size(V)
for j=1:n, k=colrange(V,j)
V[k,j] = ...
end
end
This could be probably cleaned up as an iterator but not quite sure how that would look. You can also use K=max(1,J-u):min(M,J+l)
and k=max(1,j-μ):min(M,j+λ)
if you want to be explicit.
From those PRs, it looks like there's a dilemma:
Note LU factorization does not exist (yet) for BlockBandedMatrix
. It's a bit difficult to implement due to managing the pivoting.
QR factorization does exist. It knocks the socks off UMFPack's QR, but is in the same ballpark as UMFPack's LU with pivoting. Memory usage is a potential saving here for ill-conditioned problems, and distributed memory is feasible. (So is multithreading but its already calling multithreaded LAPack so unlikely to help much.)
Cholesky could exist and is potentially very fast, but I haven't got around to implementing it. (Should work just like QR implementation.) I'm not sure if you can exploit positive definiteness in Jacobians.
I don't understand your point 2.
Point 2 is just https://github.com/JuliaDiffEq/ArrayInterface.jl/pull/23 and https://github.com/JuliaDiffEq/DiffEqDiffTools.jl/pull/74 , unless you have a better idea for how to do the decompression.
Ok will have to look.
Generally speaking, broadcast should work great for banded/banded-block-banded matrices. So if its at all possible to avoid for loops and use broadcasting that is preferable. Note provided f(0) == 0
broadcasting will preserve bandwidth and only iterate over non-zeros, and it does tricks that are hard by hand to reduce to the underlying data storage, essentially broadcasting over a regular array.
Is it possible to broadcast while keep track of (global) row/column number like f(aij, i, j)
?
See figure 1: https://pdfs.semanticscholar.org/ae4d/65769b6551a51d6fc6be2f021515bffa0798.pdf
The color decompression problem is simplified as follows. We have a color vector [1,2,3,1,1,2,3]
of terms we can differentiate at the same time. This works because numerical/forward differentiation calculates one column at a time, so if due to the sparsity there are terms which do not connect (term 1 doesn't appear in any equation with term 4 or 5), then there will be no perturbation confusion and so you can get an accurate derivative of 1, 4, and 5 at the same time. Due to the sparsity pattern of the Jacobian, there will only be one column which has that row non-zero, so we can decompress. In Figure 1, this is us calculating the green column in (c), and then pulling it out to get two columns of the Jacobian in (b).
However, the issue is how to iterate through this column effectively when (b) is a BlockBandedMatrix. It works really well with SparseMatrixCSC but we haven't found an effective way on this structure.
Is it possible to broadcast while keep track of (global) row/column number like f(aij, i, j) ?
in theory f.(A, 1:m, (1:n)’)
but it’s unlikely to be fast without work.
It’s also surprising that you’d want the global indices, for FD these don’t really tell you anything about the matrix entries.
global column index is needed to figure out the color of the column, and global row index is needed for retrieving values from differentiated vector.
It sounds like there should be block version of colouring, so instead of column j
we have Block(J)[j]
which gets the j-th column of the J-th block column. Let me read the paper and think
The color vector will definitely coincide with the block structure, i.e. you can have a 1 in each block for sure, since they definitely won't have perturbation confusion. The issue is then the compression, since you have at least one calculated value in each block, so iterating it isn't necessarily nice. There's probably a better way than what we were doing though.
I think it's best to start with banded matrices and the banded-block-banded matrix version should be straightforward. It seems like creating structurally orthogonal columns is straightforward:
l,u = bandwidths(A)
cols = [j:(l+u+1):size(A,2) for j=1:l+u]
To translate this to banded-block-banded, I think we can iterate. That is, we split the blocks as in banded, and then within each block split according to blocks.
l,u = blockbandwidths(A)
λ,μ = subblockbandwidths(A)
blockcols = [j:(l+u+1):nblocks(A,2) for j=1:l+u]
[
[begin
n = blocksize(A,1,J) # number of cols in J-th block;
[Block(J)[j:(λ+μ+1):n] for j=1:λ+μ]
end for J in blockcols[bcs]] for bcs in blockcols]
Unfortunately this is hitting some missing functionality in block indexing so I'll need to do some bug fixes in BlockArrays.jl.
This iteration method made an assumption that the colorvec
is like repeat(1:(l+u+1))[1:size(A,2)]
, but actually it is given by the user.
Well the colorvec for BandedBlockBanded will at least be "supersetted" by repeat(1:(l+u+1))[1:size(A,2)]
, with that being the minimal number of repeats.
I'm a bit confused. My general point is if the Jacobian is a BandedBlockBandedMatrix
then probably the variables themselves are best thought of as a BlockVector
, in which case indexing by block-columns is very natural.
Perhaps there's a MWE that I can look at optimising?
What's a BlockVector
?
Here is a MWE: (requires ArrayInterface@1.3.0)
using ArrayInterface, BlockBandedMatrices, BenchmarkTools, SparseArrays
J = BandedBlockBandedMatrix(Ones(10000, 10000), (fill(100, 100), fill(100, 100)), (1, 1), (1, 1))
Jsparse = sparse(J)
vfx1=ones(10000)
function f(J,vfx1)
for col_index in 1:size(J,2)
#if color[col_index] == color_i #ignored for simplicity
for row_index in ArrayInterface.findstructralnz(J,col_index) #J is a BBB matrix
J[row_index,col_index]=vfx1[row_index]
end
#end
end
end
@btime f(J,vfx1)
@btime f(Jsparse,vfx1)
That's generally the code that I'm using now. It seems J[row_index,col_index]=vfx1[row_index]
is doing lots of allocations.
How do I get ArrayInterface@1.3.0?
Ah, sorry. The PR haven't been merged yet @ChrisRackauckas. Currently, you could use this https://github.com/JuliaDiffEq/ArrayInterface.jl/pull/23 .
Hmm, I'm having trouble understanding how SparseArrays is so fast. Here's my optimised code:
using BandedMatrices
cs = BlockArrays.BlockSizes((BlockArrays.cumulsizes(J,2),)) # column block sizes
b = BlockArray(vfx1,cs) # impose block structure
function g(A,b)
λ,μ = subblockbandwidths(A)
@inbounds for J=Block.(1:nblocks(A,2)), K=BlockBandedMatrices.blockcolrange(A,J)
V = view(A,K,J)
b_v = b.blocks[K.n[1]]
data = BlockBandedMatrices.bandeddata(V)
p = pointer(data)
st = stride(data,2)
m,n = size(V)
for j=1:n,k=max(1,j-μ):min(m,j+λ)
unsafe_store!(p, b_v[k], (j-1)*st + μ + k - j + 1)
end
end
end
It's still 100μs, so roughly 200x slower than sparse!
Half of the time is running max(1,j-μ):min(m,j+λ)
, the other half doing the unsafe_store!
. These shouldn't be expensive operations...
I got it down to 37μs
function g(A,b)
@inbounds for J=Block.(1:nblocks(A,2)), K=BlockBandedMatrices.blockcolrange(A,J)
V = view(A,K,J)
b_v = b.blocks[K.n[1]]
data = BlockBandedMatrices.bandeddata(V)
p = pointer(data)
st = stride(data,2)
m,n = size(V)
unsafe_store!(p, b_v[1], 2)
unsafe_store!(p, b_v[2], 3)
for j=2:n-1
jst = (j-1)*st
unsafe_store!(p, b_v[j-1], jst + 1)
unsafe_store!(p, b_v[j], jst + 2)
unsafe_store!(p, b_v[j+1], jst + 3)
end
jst = (n-1)*st
unsafe_store!(p, b_v[n-1], jst + 1)
unsafe_store!(p, b_v[n], jst + 2)
end
end
Now half the cost is in getindex
and the other half in unsafe_store!
. I guess I'm not accessing memory "optimally": I probably need to go down columns, not go by block.
Why following columns is extremely slow? It's around 1ms!! Then it is strange that the first version using findstructralnz(J,col_index)
is the fastest, only 5μs.
function g(A,b)
λ,μ = subblockbandwidths(A)
@inbounds for J=Block.(1:nblocks(A,2))
blockcolrange=BlockBandedMatrices.blockcolrange(A,J)
_,n=BlockBandedMatrices.blocksize(A,(blockcolrange[1].n[1],J.n[1]))
@inbounds for j=1:n
@inbounds for K=blockcolrange
V = view(A,K,J)
b_v = b.blocks[K.n[1]]
data = BlockBandedMatrices.bandeddata(V)
p = pointer(data)
st = stride(data,2)
m,n = size(V)
@inbounds for k=max(1,j-μ):min(m,j+λ)
unsafe_store!(p, b_v[k], (j-1)*st + μ + k - j + 1)
end
end
end
end
end
It might be b
as a BlockVector
causing part of the problem. I'll have to do some more experimenting.
I'm terribly sorry that the sixth line of MWE should be for col_index in 1:size(J,2)
instead of for col_index in 1:length(size(J,2))
. As a consequence, the baseline for CSC matrix is 1.2 ms and @dlfivefifty your code is extremely fast!!! Though, @YingboMa suggests not to use unsafe_store!
or it could destroy alias analysis, so that vectorization isn’t possible.
Ah ok , so I’m not going crazy!
I wasn’t serious about unsafe_store!
, just wanted to get rid of all overhead due to views to do the timing.
Cheers!
https://github.com/JuliaDiffEq/DiffEqDiffTools.jl/pull/74 https://github.com/JuliaDiffEq/SparseDiffTools.jl/pull/61
This is quite an amazing feature for systems of PDEs and I can't find this auto-specialization anywhere else. Awesome work @huanglangwen and @dlfivefifty !
BandedBlockBandedMatrix
is a common sparsity structure for Jacobians (for instance, it appears when solving PDEs using finite difference methods).SparseDiffTools
makes it very fast to colorBandedBlockBandedMatrices
. However, usingBandedBlockBandedMatrices
can introduce slowdown in other parts of the code (https://github.com/JuliaDiffEq/DiffEqDiffTools.jl/issues/67) This issue is to check thatBandedBlockBandedMatrices
are as performant as possible through the entire system. @ChrisRackauckas @huanglangwen