Open dlfivefifty opened 8 years ago
I think whether or not generated functions can be used to automate the various getindex routines should be thought through before trying to do this, as it might simplify future operators considerably.
Essentially, I think it boils down to combining zero structure with rational functions for the coefficients. This could also be used to build operators on the GPU.
@MikaelSlevinsky @marcusdavidwebb Probably both of you have thought about structured operators a fair bit. Any thoughts on whether this is possible/what is the most general type of structured operator where the entries satisfy a pattern?
I suppose the Chebyshev() -> Legendre() conversion needs something beyond rational functions because the entries build on the previous ones (think pochhammer). These operators should probably be handled separately as I don't the GPU is possible with such dependency.
I'm trying to understand your suggestions. To achieve this kind of entry-wise performance, a SubOperator
would be a collection of distinct SubOperators
for every one of its constituents, and the algebra that relates them. Then, all the getindex
methods could be remodelled as addentries!
methods that could call previous entries in the constituent operators. Is this even remotely on track?
One idea, I've been contemplating for a while now, to speed up the periodic side is to hard-code Fourier
and Laurent
, rather than having them as DirectSumSpace
s. They are archetypical SumSpace
s, but there are certain inefficiencies in the modifier that one would not expect in such primitive spaces.
Sorry I was confusing. Right now the details of getindex need to be repeated twice (here I simplified the code):
function getindex(::Derivative{Chebyshev}, k::Integer, j::Integer)
if j ==k+1
k
else
0
end
end
And then in a special implementation of convert that avoids the if statement:
function convert(::Type{BandedMatrix},S::SubOperator{Derivative{Chebyshev}})
ret = bzeros(S)
for k=1:size(ret,1)
ret[k,k+1]=k
end
ret
end
Having to repeat the details of getindex
is annoying at best, and bug prone at worst. So the question is how can we set it up so that only one implementation is needed.
In the case of banded operators, we could for example do the following (BandedOperator would not actually be added back, but rather this would be done with a macro @banded
):
banded_getindex(::Derivative{Chebyshev},::Val{1},k) = k
function getindex(B::BandedOperator,k,j)
if bandinds(B,1) ≤ j-k ≤ bandinds(B,2)
banded_getindex(B,Val{j-k},k)
else
0
end
end
@generated function convert(::Type{BandedMatrix},S::SubOperator{BandedOperator})
# do some loop over bandinds that calls banded_getindex with Val{k} where k is chosen at compile time
end
But this wouldn't work with Fourier operators. So the question is whether there is a more general pattern that can be captured at compile time to avoid if statements.
In the case of banded matrices, would the bandwidths need to be accessible as type parameters of the SubOperator
for staging to work?
Oh it appears the bandwidth is typed in the actual code.
Why do we need get index if there is conversion to a banded/ragged matrix?
We need getindex because that's the easiest and most general way to implement operators. Also, when we get to populating arrays on the GPU, first writing to a banded/ragged matrix is no longer efficient.
The idea is to do code separation. Each operator should know nothing about the details of the data storage. Each data storage should know nothing about operators. Instead, the operator should only specify its structure and the entries in a suitable format. Then a separate piece of code would know about both operator structure formats and the details of the data storage so that it can populate the data.
This mimics the Model-View-Controller paradigm where GUI (View) knows nothing about the data, the data (Model) knows nothing about the GUI and a separate piece of code (Controller) controls the information flow between the View and Model.
The example code above was not an actual proposal, but just to give an idea of how this could possibly be done.
This contrast with the current situation where the convert(::Type{BandedMatrix},::SubOperator)
methods need to know both the details of BandedMatrix
and the actual entries of the operator.
(Though I think once 0.4 support is dropped, they can at least be rewritten to use @inbounds
to get rid of the knowledge of how BandedMatrix
is stored. )
But I think this is not close to being settled, so in the interim probably just do special overrides for Fourier.
If it's to be on the GPU, I wonder if the getindex
approach is sufficient, if blocks of the operator are dispatched to each of the cores. (How slow are conditional statements?)
Can you do conditional statements on the GPU? It still definitely won't be as optimal as being able to do, e.g. for Fourier
A[band(1)][1:2:2n-1]=(1:n)
A[band(-1)][1:2:2n-1]=-(1:n)
I don't have first-hand experience, but here is a CUDA GPU repository, and it has conditional statements in the Givens rotations.
That is a sick use of band
by the way.
I think trying to use CUDA directly is too daunting, so I'm thinking more in the ArrayFire.jl vein. In this case we could have
AFBandedMatrix{T}
data::AFArray{T,2}
l::Int
u::Int
end
and so in principle the above code would work for both BandedMatrix
and AFBandedMatrix
, as 1:n
requires no data transfer and I think with the right overrides the data will only ever live on the GPU.
ArrayFire
also has a qr
code, but I believe its slower than the CPU. But I think that still might outperform the current adaptive qr approach as right now generating the operators can be about as expensive as performing Householder, and once the operator is prefactored, applying Q is just dot products hence could be done very fast on the GPU.
(Why I care about speed: PDEs in 3D)
I now think the right approach to speeding this up is to stop using BandedMatrix
solvers and instead switch everything to BandedBlockMatrix
. This avoids the tedious and annoying task of working out bandinds
, as blockbandinds
is trivial. I think the speed penalty will be minimal, and if anything, would be a lot faster for large blocks as then BLAS Level 3 can be used.
(That comment is independent of 'pattern' operator building)
Fourier ode solving is dreadfully slow. This is because it resorts to many getindex calls.