JuliaGaussianProcesses / KernelFunctions.jl

Julia package for kernel functions for machine learning
https://juliagaussianprocesses.github.io/KernelFunctions.jl/stable/
MIT License
267 stars 32 forks source link

State of GPU support #431

Open simsurace opened 2 years ago

simsurace commented 2 years ago

I wanted to ask for an overview of the current state of GPU support of this package. It appears as though there are several issues related to whether the package works nicely on the GPU (i.e. with GPU arrays as inputs) and several proposed solutions, but getting KernelFunctions.jl to work on the GPU seems to be delayed by other things breaking, like AD.

I was wondering whether a clear path forward is already emerging. Since I've done some GPU work before, I'd be happy to help getting this package work on the GPU.

For what I mean by GPU support, at least the following should be possible:

using CUDA, KernelFunctions
CUDA.allowscalar(false)
x = CUDA.rand(16)
k = SEKernel()
kernelmatrix(k, x)
devmotion commented 2 years ago

There are already some ongoing work, open issues, and corresponding PRs regarding GPU support. Some issues/tasks are listed also in https://github.com/JuliaGaussianProcesses/ApproximateGPs.jl/issues/15. Relevant in KernelFunctions are e.g., https://github.com/JuliaGaussianProcesses/KernelFunctions.jl/issues/299, https://github.com/JuliaGaussianProcesses/KernelFunctions.jl/issues/380, and the linked PRs. Unfortunately, I was busy with other stuff but I'll try to take up and focus on https://github.com/JuliaGaussianProcesses/KernelFunctions.jl/pull/397 in the next weeks.

simsurace commented 2 years ago

Thanks! From (superficially) reading the issues and PRs (also e.g. #386), there seem to be some roadblocks which prevent things from progressing. Or is it just lack of time/resources at this point? For example, is #386 the preferred direction for making kernel evaluation/operations on the GPU possible, and do the AD issues with that approach seem resolvable (I'm not very familiar with the entirety of the AD landscape and did not yet go into the details of that conversation, but it seems quite hard), or should we explore alternative ways for making that happen (KernelAbstractions, custom CUDA kernels, whatnot)? I'm interested in helping move this forward, but I'm not sure where it's more effective to invest time.

On a related note, I already did some easy fixes to prevent scalar indexing that was being triggered when running the optimizer in this minimum working example, mainly while evaluating AbstractGPs._compute_intermediates, such as this cholesky on diagonal matrices issue or this thing with symmetric matrices. There are a few more things along those lines to hunt down.

theogf commented 2 years ago

Only speaking for #386 it is both a matter of time and of choices. There is unfortunately no silver bullet to deal with AD, GPU and co while keeping an optimal performance. I still think #386 is the solution though (with some decisions to be made on what the generic fallback should be)