Speed up Fourier ODE solving

dlfivefifty commented 8 years ago

Fourier ode solving is dreadfully slow. This is because it resorts to many getindex calls.

dlfivefifty commented 8 years ago

I think whether or not generated functions can be used to automate the various getindex routines should be thought through before trying to do this, as it might simplify future operators considerably.

Essentially, I think it boils down to combining zero structure with rational functions for the coefficients. This could also be used to build operators on the GPU.

@MikaelSlevinsky @marcusdavidwebb Probably both of you have thought about structured operators a fair bit. Any thoughts on whether this is possible/what is the most general type of structured operator where the entries satisfy a pattern?

I suppose the Chebyshev() -> Legendre() conversion needs something beyond rational functions because the entries build on the previous ones (think pochhammer). These operators should probably be handled separately as I don't the GPU is possible with such dependency.

MikaelSlevinsky commented 8 years ago

I'm trying to understand your suggestions. To achieve this kind of entry-wise performance, a SubOperator would be a collection of distinct SubOperators for every one of its constituents, and the algebra that relates them. Then, all the getindex methods could be remodelled as addentries! methods that could call previous entries in the constituent operators. Is this even remotely on track?

One idea, I've been contemplating for a while now, to speed up the periodic side is to hard-code Fourier and Laurent, rather than having them as DirectSumSpaces. They are archetypical SumSpaces, but there are certain inefficiencies in the modifier that one would not expect in such primitive spaces.

dlfivefifty commented 8 years ago

Sorry I was confusing. Right now the details of getindex need to be repeated twice (here I simplified the code):

function getindex(::Derivative{Chebyshev}, k::Integer, j::Integer) 
   if j ==k+1
     k
   else
     0
  end
end

And then in a special implementation of convert that avoids the if statement:

function convert(::Type{BandedMatrix},S::SubOperator{Derivative{Chebyshev}})
    ret = bzeros(S)

   for k=1:size(ret,1)
      ret[k,k+1]=k
   end
   ret
end

Having to repeat the details of getindex is annoying at best, and bug prone at worst. So the question is how can we set it up so that only one implementation is needed.

In the case of banded operators, we could for example do the following (BandedOperator would not actually be added back, but rather this would be done with a macro @banded):

banded_getindex(::Derivative{Chebyshev},::Val{1},k) = k

function getindex(B::BandedOperator,k,j)
   if bandinds(B,1) ≤ j-k ≤ bandinds(B,2)
       banded_getindex(B,Val{j-k},k)
   else 
       0
   end
end

@generated function convert(::Type{BandedMatrix},S::SubOperator{BandedOperator})
   # do some loop over bandinds that calls banded_getindex with Val{k} where k is chosen at compile time
end

But this wouldn't work with Fourier operators. So the question is whether there is a more general pattern that can be captured at compile time to avoid if statements.

MikaelSlevinsky commented 8 years ago

In the case of banded matrices, would the bandwidths need to be accessible as type parameters of the SubOperator for staging to work?

MikaelSlevinsky commented 8 years ago

Oh it appears the bandwidth is typed in the actual code.

MikaelSlevinsky commented 8 years ago

Why do we need get index if there is conversion to a banded/ragged matrix?

dlfivefifty commented 8 years ago

We need getindex because that's the easiest and most general way to implement operators. Also, when we get to populating arrays on the GPU, first writing to a banded/ragged matrix is no longer efficient.

The idea is to do code separation. Each operator should know nothing about the details of the data storage. Each data storage should know nothing about operators. Instead, the operator should only specify its structure and the entries in a suitable format. Then a separate piece of code would know about both operator structure formats and the details of the data storage so that it can populate the data.

This mimics the Model-View-Controller paradigm where GUI (View) knows nothing about the data, the data (Model) knows nothing about the GUI and a separate piece of code (Controller) controls the information flow between the View and Model.

The example code above was not an actual proposal, but just to give an idea of how this could possibly be done.

dlfivefifty commented 8 years ago

This contrast with the current situation where the convert(::Type{BandedMatrix},::SubOperator) methods need to know both the details of BandedMatrix and the actual entries of the operator.

(Though I think once 0.4 support is dropped, they can at least be rewritten to use @inbounds to get rid of the knowledge of how BandedMatrix is stored. )

dlfivefifty commented 8 years ago

But I think this is not close to being settled, so in the interim probably just do special overrides for Fourier.

MikaelSlevinsky commented 8 years ago

If it's to be on the GPU, I wonder if the getindex approach is sufficient, if blocks of the operator are dispatched to each of the cores. (How slow are conditional statements?)

dlfivefifty commented 8 years ago

Can you do conditional statements on the GPU? It still definitely won't be as optimal as being able to do, e.g. for Fourier

A[band(1)][1:2:2n-1]=(1:n)
A[band(-1)][1:2:2n-1]=-(1:n)

MikaelSlevinsky commented 8 years ago

I don't have first-hand experience, but here is a CUDA GPU repository, and it has conditional statements in the Givens rotations.

MikaelSlevinsky commented 8 years ago

That is a sick use of band by the way.

dlfivefifty commented 8 years ago

I think trying to use CUDA directly is too daunting, so I'm thinking more in the ArrayFire.jl vein. In this case we could have

AFBandedMatrix{T}
   data::AFArray{T,2}
   l::Int
   u::Int
end

and so in principle the above code would work for both BandedMatrix and AFBandedMatrix, as 1:n requires no data transfer and I think with the right overrides the data will only ever live on the GPU.

dlfivefifty commented 8 years ago

ArrayFire also has a qr code, but I believe its slower than the CPU. But I think that still might outperform the current adaptive qr approach as right now generating the operators can be about as expensive as performing Householder, and once the operator is prefactored, applying Q is just dot products hence could be done very fast on the GPU.

dlfivefifty commented 8 years ago

(Why I care about speed: PDEs in 3D)

dlfivefifty commented 7 years ago

I now think the right approach to speeding this up is to stop using BandedMatrix solvers and instead switch everything to BandedBlockMatrix. This avoids the tedious and annoying task of working out bandinds, as blockbandinds is trivial. I think the speed penalty will be minimal, and if anything, would be a lot faster for large blocks as then BLAS Level 3 can be used.

dlfivefifty commented 7 years ago

(That comment is independent of 'pattern' operator building)

JuliaApproximation / ApproxFun.jl

Speed up Fourier ODE solving #395