getkeops / keops

KErnel OPerationS, on CPUs and GPUs, with autodiff and without memory overflows
https://www.kernel-operations.io
MIT License
1.02k stars 65 forks source link

Support for lazy indexing / lookup? #367

Open jaschau opened 3 months ago

jaschau commented 3 months ago

Hi,

first of all I want to say thanks for this great library. I have a use case that corresponds essentially to a lazy lookup: given a matrix A of shape P x Q and integer vectors x, y of length N, M and with values in [0, P) and [0, Q), respectively, I would like to compute the N x M matrix

C[i, j] = A[x[i], y[j]]

lazily since N and M are large numbers, so materializing the matrix is costly memory wise. I would like to sum C[i, j] with other LazyTensors before eventually performing reductions. I was wondering if you think this could be in principle supported by this library and if you have any pointers how to possibly implement this. Thanks a lot for your help!

bcharlier commented 3 months ago

Hi @jaschau is this snippet helpful ?

import torch
from pykeops.torch import LazyTensor, Genred

device = "cuda" if torch.cuda.is_available() else "cpu"

M, N = 1000, 1001

# A standard LazyTensor
X = LazyTensor(torch.rand(M, 1, 1, device=device))
Y = LazyTensor(torch.rand(1, N, 1,  device=device))
DX = (X - Y).square()

# An indices based LazyTensor
I = LazyTensor(torch.arange(M, device=device).type(torch.float32).reshape(M, 1, 1))
J = LazyTensor(torch.arange(N, device=device).type(torch.float32).reshape(1, N, 1))
C_IJ =  (I - J).abs()

# reduce the LazyTensors
print((DX * C_IJ).sum(1))
jaschau commented 3 months ago

Hi @bcharlier,

thanks for your message! I don't think the snippet you suggested achieves what Iam looking for.

Written in components, your snippet computes

$$vi = \sum{j=0}^N |X_i - Y_j|^2 |I_i -I_j|$$

What I am looking for is the possibility to compute something like

$$vi = \sum{j=0}^N |X_i - Yj|^2 A{I_i, J_j}$$,

i.e., use the LazyTensors $I_i, J_j$ to index into another matrix A. Does that make sense?

joanglaunes commented 3 months ago

Hello @jaschau , I just added the support for what you are asking for, in a git branch named Index : it adds a KeOps operation called Index that allows to index a vector by another vector, and the corresponding indexing operator [] for LazyTensor. We cannot do it for directly indexing matrices with the current implementation of KeOps because only the last axis is used for the internal shape of the formula. So it means that you will have to flatten the matrix and convert indices to linear indices, i.e. write A[I * ncol + J] instead of A[I,J]. Also you need to convert index vectors to float dtype, because integer valued arrays are not allowed. Here is an example that computes a sum over $j$ of the $A_{I_i,J_j}$ :

import torch
from pykeops.torch import LazyTensor

def fun_torch(A, I, J):
    return A[I, J].sum(axis=1)

def fun_keops(A, I, J):
    ncol = A.shape[1]
    A = LazyTensor(A.flatten())
    I = LazyTensor((I + 0.0)[..., None])
    J = LazyTensor((J + 0.0)[..., None])
    K = A[I * ncol + J]
    return K.sum(axis=1).flatten()

P, Q = 12, 5
M, N = 3000, 2000
device = "cuda" if torch.cuda.is_available() else "cpu"
A = torch.randn((P, Q), requires_grad=True, device=device)
I = torch.randint(P, (M, 1), device=device)
J = torch.randint(Q, (1, N), device=device)

res_torch = fun_torch(A, I, J)
print(res_torch)

res_keops = fun_keops(A, I, J)
print(res_keops)

# testing gradients
loss_torch = (res_torch**2).sum()
print(torch.autograd.grad(loss_torch, [A]))

loss_keops = (res_keops**2).sum()
print(torch.autograd.grad(loss_keops, [A]))

Let us know if you can test this branch and see if it works as expected.

jaschau commented 3 months ago

Hi @joanglaunes, thanks so much! That looks amazing. The API looks very clean. I could run the example you provided without any issues on the Index branch. Just wondering, how are you handling out-of-bounds access in the implementation? Are you silently clamping to the valid index region? I will test this also on my actual use-case and compare to my current implementation which uses an expensive online computation instead of a lookup. I will report back if I encounter any issues.

BTW, is there some documentation on how to add support for new ops with the new python-based keops code? I briefly looked at it prior to raising the issue, but I found it a bit hard to parse without some high-level overview of the main ideas.

jaschau commented 3 months ago

I had a strange observation. I played around with the values of P and Q from your example. If I set P and Q to slighly larger values which should still fit into memory, I get weird CUDA errors from the fun_keops calls. For example, setting P=100, Q=600, I obtain an CUDA_ERROR_OUT_OF_MEMORY error

[KeOps] error: cuLaunchKernel( kernel, gridSize_x, gridSize_y, gridSize_z, blockSize_x, blockSize_y, blockSize_z, blockSize_x * dimY * sizeof(TYPE), NULL, kernel_params, 0) failed with error CUDA_ERROR_OUT_OF_MEMORY

With P=1000, Q=1000, I obtain a CUDA_ERROR_INVALID_VALUE error:

[KeOps] error: cuLaunchKernel( kernel, gridSize_x, gridSize_y, gridSize_z, blockSize_x, blockSize_y, blockSize_z, blockSize_x * dimY * sizeof(TYPE), NULL, kernel_params, 0) failed with error CUDA_ERROR_INVALID_VALUE

Do you have any insights what might happen here?

joanglaunes commented 3 months ago

Hello @jaschau , Ok ; so actually the dimensions of A are not meant to be large... In fact the matrix A is a parameter of the formula, so it is fully loaded in the local memory of every thread, which is extremely limited. When the size of parameter vectors exceeds about 100 (so PQ>100 in your case), this memory starts to get saturated, and the Cuda kernel will have to do some data transfer to the global Gpu memory. Of course you can go beyond 100, but it will start to be much slower afterwards. In your case, with PQ=100x600 or PQ=1000x1000, this goes well beyond what we have ever tried and I understand Cuda throws memory errors in such cases. So sorry I could have warned you in the first place, but if your use case is within this range of values for the matrix A, the implementation I did cannot help you I think. Maybe you can tell us about typical values you have for the sizes of A and the indexing vectors ? It may help to think about other ways to implement.

jaschau commented 3 months ago

Hi @joanglaunes, Thanks for your message! That's good to know. Sorry for not being more precise in the first place. In fact, for my use-case, A itself would be a fairly large matrix of size ~ 10k x 10k (which I can still fit into GPU memory). The index vectors would be roughly one order of magnitude larger ~ 100k. So the corresponding matrix would definitely not fit on the GPU, hence the hope for a pykeops solution.

Edit: for my use-case, I can order the indices in such a way that I'm accessing A very locally, so it would be possible to implement it in a Cache friendly way on the CPU. But I guess for the parallel CUDA programming paradigm, things might be different. And in any case, the local access will not be true for arbitrary sets of indices I, J.

joanglaunes commented 3 months ago

The simplest thing to try would be to add a special mode for large "parameter" variables, in which these parameters are not loaded into local memory but just kept in the main global memory. I don't know how much it will decrease performance, but it will just work in all cases and it is not very complicated to do. The other way, taking benefit of the fact that A is accessed locally, would be probably the best for efficiency but it is a lot more challenging.

jaschau commented 3 months ago

Going for the simplest solution sounds reasonable to me. Let me know if I can support this in any way!

joanglaunes commented 2 months ago

Hello @jaschau , Here are some news. I have implemented the solution I was proposing, so that now the below script works for example (you can try it in the branch Index). On a server with a A40 card it runs in 1.5s, I don't know if it is good or bad news for your project. Definitely the fact that matrix A is kept in global GPU memory slows down the process a lot... Note that also I use float64 data type here, which is necessary for addressing the matrix A when its number of elements exceeds 1e6. It should not be necessary for the data stored in A itself, but unfortunately we just allow one identical data type for all tensors currently in KeOps. The main remaining problem is that you still cannot compute the gradient of the operation for such values of P and Q, because then we need to also adapt the way we handle the output of the KeOps operation, and not only the input tensors. I have done some attempts for it but it is not ready yet.

import torch
from pykeops.torch import LazyTensor
from time import time

device = "cuda" if torch.cuda.is_available() else "cpu"
dtype = torch.float64

def fun_keops(A, I, J):
    ncol = A.shape[1]
    A = LazyTensor(A.flatten())
    I = LazyTensor(I.to(dtype)[..., None])
    J = LazyTensor(J.to(dtype)[..., None])
    K = A[I * ncol + J]
    return K.sum(axis=1).flatten()

P, Q = 10000, 10000
M, N = 100000, 100000

A = torch.randn((P, Q), requires_grad=True, device=device, dtype=dtype)
I = torch.randint(P, (M, 1), device=device)
J = torch.randint(Q, (1, N), device=device)

start = time()
res_keops = fun_keops(A, I, J)
end = time()
print("time for keops:", end - start)