JuliaMath / Interpolations.jl

Fast, continuous interpolation of discrete datasets in Julia
http://juliamath.github.io/Interpolations.jl/
Other
536 stars 110 forks source link

GPU support? #357

Closed haampie closed 1 year ago

haampie commented 4 years ago

I was experimenting with Lanczos interpolation (convolution of sinc(x)sinc(x/a)) of 3D data. In my use case I typically have O(n^3) data points and O(n^2) interpolation points. So far a very mediocre CUDA implementation seems to be faster than Interpolations.jl's quadratic b-spline:

https://nextjournal.com/stabbles/cuda-interpolation

I'm sure the CUDA implementation can be improved quite a bit. Would it be possible to run Interpolations.jl's functions on the GPU? I figure the bottleneck is the lu decomp mostly?

timholy commented 4 years ago

Similar: #356. Have you profiled? Single-threaded I see slightly more time for the interpolation itself than for the prefiltering; the random access pattern is pretty murderous on cache performance.

Would be great to have GPU versions under a @require.

stillyslalom commented 3 years ago

There's been some recent work on (nearest neighbor/linear/cubic) texture interpolation (https://github.com/JuliaGPU/CUDA.jl/pull/460#issuecomment-701441464), although the API is still pretty rough/experimental. It might already be enough to be worth hooking into Interpolations.jl, though.

mkitti commented 3 years ago

CUDA.jl appears to import Interpolations.jl. That is CUDA.jl appears to hook into Interpolations.jl rather than vice versa.

Are you proposing going the other way around? Perhaps there should be a distinct CUDAInterpolations.jl package which implements the intersection of the two packages?

stillyslalom commented 3 years ago

No, right now Interpolations.jl is only used for testing: https://github.com/JuliaGPU/CUDA.jl/search?q=using+interpolations

The standard interface design seems to be, as @timholy noted, siloing additional functionality within a module that's loaded by @require.

mkitti commented 3 years ago

I see, Interpolations is just imported there just for the tests.

https://github.com/JuliaGPU/CUDA.jl/blob/465535069383ce66b137c6c64f188e4b9164ec15/test/texture.jl

I'm still leaning towards a distinct CUDAInterpolations.jl package that would depend on Interpolations and CUDA but extend the Interpolations interface for CuArray. The main reason for this is that package would require a distinct continuous integration infrastructure that this package does not currently need.

stillyslalom commented 3 years ago

It's totally reasonable to try to keep the CI infra simple for this package. Even with the functionality in a separate package, though, it should still be possible to use Requires.jl to automatically load any relevant CUDAInterpolations.jl methods without needing to explicitly load the package.

mkitti commented 3 years ago

It's totally reasonable to try to keep the CI infra simple for this package. Even with the functionality in a separate package, though, it should still be possible to use Requires.jl to automatically load any relevant CUDAInterpolations.jl methods without needing to explicitly load the package.

I'll have to see what CUDAInterpolations.jl ends up looking like. I usually have to manage host <-> device transfers pretty carefully so that overhead does not overwhelm the processing speed up.

stillyslalom commented 3 years ago

For my use case, the data would be on the GPU in the first place - no need for transfer. It might be enough to provide constructors like interpolate(a::CuArray{T}, options...) - I don't know if you can load textures directly into texture memory from the CPU.

cc @maleadt

maleadt commented 3 years ago

For my use case, the data would be on the GPU in the first place - no need for transfer. It might be enough to provide constructors like interpolate(a::CuArray{T}, options...) - I don't know if you can load textures directly into texture memory from the CPU.

Yes, you can bind a texture object from a pre-existing CuArray. If you're uploading from the host, its better to create a special CuTextureArray (which is padded so that fetches perform better). It's also possible to copy a CuArray to a CuTextureArray (this doesn't leave the device, but there's still some overhead of course). See https://github.com/JuliaGPU/CUDA.jl/blob/0679e2feba777ac54299c1fbd693263186bead53/test/texture.jl#L41-L100 for examples.

mkitti commented 3 years ago

Great. I'm pretty sure a distinct CUDAInterpolations.jl approach still makes the most sense. It would need to run under the JuliaGPU testing infrastructure.

stillyslalom commented 3 years ago

Nearest-neighbor, bilinear, and kernel-y (Lanczos et al) interpolants should be pretty straightforward, but quadratic/cubic B-spline interpolants will require reworking the prefiltering step for GPU compatibility. The authors of this paper on prefiltering made their code available here under a BSD license - not sure whether there's any advantage to rewriting it.

jmulcahy commented 3 years ago

As a first step, it would be nice to even have an interpolation function that runs on the GPU but is constructed on the CPU. I assume that's probably a simpler process to vectorize? I got a bit confused trying to walk through the code myself.

N5N3 commented 3 years ago

@jmulcahy ExBroadcast.jl offer such support (interpolation on GPU which is constructed on the CPU) via broadcast interface. If you use CUDA.jl, example codes are:

using ExBroadcast, Interpolations, CUDA, Adapt
a = randn(ComplexF32,401,401)
itp = CubicSplineInterpolation((-100:0.5f0:100,-100:0.5f0:100), a) # only BSpline is tested (Lanczos should also work)
cuitp = adapt(CuArray,itp)
cuitp.(-150:0.5f0:100,(-100:0.5f0:100)') # 20001×20001 CuArray{ComplexF32, 2}
cuitp.(CUDA.randn(100),CUDA.randn(100)) # 100 CuArray{ComplexF32, 1}
cuitp.(randn(100),randn(100)) # error

If you use other GPU package, you can find relate code in /src/other_support/interp.jl and you can make your own patch.

koehlerson commented 3 years ago

I would be very interested to have something for AMDGPU. Any pointers are appreciated :)

N5N3 commented 3 years ago

If you just want to do interp on GPU with a kernal constructed on CPU, define the adapt_structure for the AbstractInterpolation is enough. i.e.:

import Adapt: adapt_structure, adapt
adapt_structure(to, itp::BSplineInterpolation{T,N}) where {T,N} = begin
    coefs, parentaxes, it = itp.coefs, itp.parentaxes, itp.it
    coefs′ = adapt(to, coefs)
    Para = map(typeof, (coefs′,it,parentaxes))
    BSplineInterpolation{T,N,Para...}(coefs′, parentaxes, it)
end
adapt_structure(to, itp::Extrapolation{T,N}) where {T,N} = begin
    et = itp.et
    itp′ = adapt(to,  itp.itp)
    IT = itptype(itp)
    Extrapolation{T,N,typeof(itp′),IT,typeof(et)}(itp′, et)
end

And use the broadcast system to calculate the result like above

koehlerson commented 3 years ago

ExBroadcast fails for me in precompilation and I guess it's needed in order to work?

https://github.com/N5N3/ExBroadcast.jl/blob/main/src/util.jl#L106 Base.@invoke is not defined

N5N3 commented 3 years ago

Well ExBroadcast.jl is a package for self-use so I didnt ensure compatablity. Base.@invoke is introduced in a newer Julia release. The code you need is all included in https://github.com/N5N3/ExBroadcast.jl/blob/main/src/other_support/interp.jl Put them in a .jl file --> remove L52 and L110 --> include the .jl file should work.

koehlerson commented 3 years ago

So 1.7? I'm on 1.6.1.

I'll give it a try, thanks for sharing the information!

marius311 commented 2 years ago

I took a look at this and looks like at least for BSpline its just a missing adapt method and it works, and able to saturate the GPU (an A100 here) to 100%, giving a 50X speed-boost:

using Interpolations, CUDA, CUDA.Adapt, BenchmarkTools

function Adapt.adapt_structure(to, itp::Interpolations.BSplineInterpolation{T,N,<:Any,IT,Axs}) where {T,N,IT,Axs}
    coefs = Adapt.adapt_structure(to, itp.coefs)
    Tcoefs = typeof(coefs)
    Interpolations.BSplineInterpolation{T,N,Tcoefs,IT,Axs}(coefs, itp.parentaxes, itp.it)
end

y = rand(10)
itp = interpolate(y, BSpline(Quadratic(Reflect(OnCell()))))
cu_itp = cu(itp);
new_x = collect(range(1,10,length=1000000))
cu_new_x = cu(new_x)

@btime itp.(new_x) # 2.5ms
@btime CUDA.@sync cu_itp.(cu_new_x) # 46μs

I unfortunately don't have the time right now to do this for everything / PR, but hopefully someone finds this helpful.

longemen3000 commented 1 year ago

closed by #504 ?