Closed haampie closed 1 year ago
Similar: #356. Have you profiled? Single-threaded I see slightly more time for the interpolation itself than for the prefiltering; the random access pattern is pretty murderous on cache performance.
Would be great to have GPU versions under a @require
.
There's been some recent work on (nearest neighbor/linear/cubic) texture interpolation (https://github.com/JuliaGPU/CUDA.jl/pull/460#issuecomment-701441464), although the API is still pretty rough/experimental. It might already be enough to be worth hooking into Interpolations.jl, though.
CUDA.jl appears to import Interpolations.jl. That is CUDA.jl appears to hook into Interpolations.jl rather than vice versa.
Are you proposing going the other way around? Perhaps there should be a distinct CUDAInterpolations.jl package which implements the intersection of the two packages?
No, right now Interpolations.jl is only used for testing: https://github.com/JuliaGPU/CUDA.jl/search?q=using+interpolations
The standard interface design seems to be, as @timholy noted, siloing additional functionality within a module that's loaded by @require
.
I see, Interpolations is just imported there just for the tests.
https://github.com/JuliaGPU/CUDA.jl/blob/465535069383ce66b137c6c64f188e4b9164ec15/test/texture.jl
I'm still leaning towards a distinct CUDAInterpolations.jl package that would depend on Interpolations and CUDA but extend the Interpolations interface for CuArray
. The main reason for this is that package would require a distinct continuous integration infrastructure that this package does not currently need.
It's totally reasonable to try to keep the CI infra simple for this package. Even with the functionality in a separate package, though, it should still be possible to use Requires.jl to automatically load any relevant CUDAInterpolations.jl methods without needing to explicitly load the package.
It's totally reasonable to try to keep the CI infra simple for this package. Even with the functionality in a separate package, though, it should still be possible to use Requires.jl to automatically load any relevant CUDAInterpolations.jl methods without needing to explicitly load the package.
I'll have to see what CUDAInterpolations.jl ends up looking like. I usually have to manage host <-> device transfers pretty carefully so that overhead does not overwhelm the processing speed up.
For my use case, the data would be on the GPU in the first place - no need for transfer. It might be enough to provide constructors like interpolate(a::CuArray{T}, options...)
- I don't know if you can load textures directly into texture memory from the CPU.
cc @maleadt
For my use case, the data would be on the GPU in the first place - no need for transfer. It might be enough to provide constructors like
interpolate(a::CuArray{T}, options...)
- I don't know if you can load textures directly into texture memory from the CPU.
Yes, you can bind a texture object from a pre-existing CuArray. If you're uploading from the host, its better to create a special CuTextureArray (which is padded so that fetches perform better). It's also possible to copy a CuArray to a CuTextureArray (this doesn't leave the device, but there's still some overhead of course). See https://github.com/JuliaGPU/CUDA.jl/blob/0679e2feba777ac54299c1fbd693263186bead53/test/texture.jl#L41-L100 for examples.
Great. I'm pretty sure a distinct CUDAInterpolations.jl approach still makes the most sense. It would need to run under the JuliaGPU testing infrastructure.
Nearest-neighbor, bilinear, and kernel-y (Lanczos et al) interpolants should be pretty straightforward, but quadratic/cubic B-spline interpolants will require reworking the prefiltering step for GPU compatibility. The authors of this paper on prefiltering made their code available here under a BSD license - not sure whether there's any advantage to rewriting it.
As a first step, it would be nice to even have an interpolation function that runs on the GPU but is constructed on the CPU. I assume that's probably a simpler process to vectorize? I got a bit confused trying to walk through the code myself.
@jmulcahy ExBroadcast.jl offer such support (interpolation on GPU which is constructed on the CPU) via broadcast interface.
If you use CUDA.jl
, example codes are:
using ExBroadcast, Interpolations, CUDA, Adapt
a = randn(ComplexF32,401,401)
itp = CubicSplineInterpolation((-100:0.5f0:100,-100:0.5f0:100), a) # only BSpline is tested (Lanczos should also work)
cuitp = adapt(CuArray,itp)
cuitp.(-150:0.5f0:100,(-100:0.5f0:100)') # 20001×20001 CuArray{ComplexF32, 2}
cuitp.(CUDA.randn(100),CUDA.randn(100)) # 100 CuArray{ComplexF32, 1}
cuitp.(randn(100),randn(100)) # error
If you use other GPU package, you can find relate code in /src/other_support/interp.jl
and you can make your own patch.
I would be very interested to have something for AMDGPU. Any pointers are appreciated :)
If you just want to do interp on GPU with a kernal constructed on CPU, define the adapt_structure
for the AbstractInterpolation
is enough. i.e.:
import Adapt: adapt_structure, adapt
adapt_structure(to, itp::BSplineInterpolation{T,N}) where {T,N} = begin
coefs, parentaxes, it = itp.coefs, itp.parentaxes, itp.it
coefs′ = adapt(to, coefs)
Para = map(typeof, (coefs′,it,parentaxes))
BSplineInterpolation{T,N,Para...}(coefs′, parentaxes, it)
end
adapt_structure(to, itp::Extrapolation{T,N}) where {T,N} = begin
et = itp.et
itp′ = adapt(to, itp.itp)
IT = itptype(itp)
Extrapolation{T,N,typeof(itp′),IT,typeof(et)}(itp′, et)
end
And use the broadcast system to calculate the result like above
ExBroadcast fails for me in precompilation and I guess it's needed in order to work?
https://github.com/N5N3/ExBroadcast.jl/blob/main/src/util.jl#L106 Base.@invoke
is not defined
Well ExBroadcast.jl
is a package for self-use so I didnt ensure compatablity. Base.@invoke
is introduced in a newer Julia release.
The code you need is all included in https://github.com/N5N3/ExBroadcast.jl/blob/main/src/other_support/interp.jl
Put them in a .jl
file --> remove L52 and L110 --> include the .jl
file should work.
So 1.7? I'm on 1.6.1.
I'll give it a try, thanks for sharing the information!
I took a look at this and looks like at least for BSpline its just a missing adapt
method and it works, and able to saturate the GPU (an A100 here) to 100%, giving a 50X speed-boost:
using Interpolations, CUDA, CUDA.Adapt, BenchmarkTools
function Adapt.adapt_structure(to, itp::Interpolations.BSplineInterpolation{T,N,<:Any,IT,Axs}) where {T,N,IT,Axs}
coefs = Adapt.adapt_structure(to, itp.coefs)
Tcoefs = typeof(coefs)
Interpolations.BSplineInterpolation{T,N,Tcoefs,IT,Axs}(coefs, itp.parentaxes, itp.it)
end
y = rand(10)
itp = interpolate(y, BSpline(Quadratic(Reflect(OnCell()))))
cu_itp = cu(itp);
new_x = collect(range(1,10,length=1000000))
cu_new_x = cu(new_x)
@btime itp.(new_x) # 2.5ms
@btime CUDA.@sync cu_itp.(cu_new_x) # 46μs
I unfortunately don't have the time right now to do this for everything / PR, but hopefully someone finds this helpful.
closed by #504 ?
I was experimenting with Lanczos interpolation (convolution of
sinc(x)sinc(x/a)
) of 3D data. In my use case I typically have O(n^3) data points and O(n^2) interpolation points. So far a very mediocre CUDA implementation seems to be faster than Interpolations.jl's quadratic b-spline:https://nextjournal.com/stabbles/cuda-interpolation
I'm sure the CUDA implementation can be improved quite a bit. Would it be possible to run Interpolations.jl's functions on the GPU? I figure the bottleneck is the lu decomp mostly?