getkeops / keops

KErnel OPerationS, on CPUs and GPUs, with autodiff and without memory overflows
https://www.kernel-operations.io
MIT License
1.03k stars 65 forks source link

Block gradient kernel construction and permutation strategy + limits of math operations with lazytensors #261

Open JGittles opened 2 years ago

JGittles commented 2 years ago

Hi @jeanfeydy,

Ive been working on constructing a block gradient RBF kernel much in the same vein as #205 but trying to avoid getting the scipy.sparse library involved as to keep all operations on GPU. As of now I've achieved the block structure through use of quadrant masks (somewhat in the spirit of #156 ) within my kernel matrix where it seems that the main drawback is the need to tile my input tensors and construct a [4*n,4] mask as a dense tensor.

I am trying to implement my kernel in Gpytorch as a drag and drop replacement for the rbfwithgrad kernel (https://docs.gpytorch.ai/en/latest/_modules/gpytorch/kernels/rbf_kernel_grad.html) and as such, it seems that my kernel needs to undergo a perfect shuffle permutation to match multitask ordering in Gpytorch (opening a ticket on their respective repo to see if this is truly the only way forward).

As such, I've hit a wall with how to proceed as I can't easily perform indexing operations on lazytensors (to the best of my knowledge). So far, I've identified a few ways forward and am interested in receiving feedback on which path is best/possible.

  1. build a permutation matrix and do a matrix-matrix tensordot operation on my kernel - this is relatively straightforward in the small number regime but quickly becomes infeasible as I would be storing a [n4,n4] tensor in memory. Is it possible to construct a lazytensor from a sparse torch matrix in COO format? if not, would it be possible to use the Zero() function to create a vector of zero's, insert values at index positions, and vertically concat several of these together? I have had no luck thus far figuring out how to use the built-in Zero() function and it seems to be missing from the api reference.
  2. direct index operations on my kernel matrix, it seems that you can only perform indexing on the last dimension of a lazytensor, it also seems like my reduced output kernel matrix thatis of shape [n4,n4] has a third "dummy" dimension that is either 1 or 0 so I cannot perform slices on it, can I recast this as a "true" 2d lazytensor or do a reduction to remove that last dimension? if so I could theoretically split it into smaller vectors, permute, and re-concat.

Thanks in advance for your help! if this gets successfully stood up I'd love to create a pr and hopefully expand the functionality of this great library with gpytorch.

antoinediez commented 1 year ago

Dear @JGittles,

I think I recently stumbled on a problem similar to the one you raise in your first point: essentially I have a sparse graph stored in COO format and I would like to turn its adjacency matrix into a LazyTensor (the true dense tensor is too big to fit the memory). I got a partial, not completely satisfactory solution but I share it here in case it might help you.

The idea is to change the COO format into a different one which does not take more memory but allows us to easily build a "Lazy adjacency matrix". Let the dimension of the dense tensor be (M,N) and let's assume that on each row, there are no more than a fixed number Nmax << N of nonzero coefficients. Then a sparse tensor in COO format can be converted into two tensors of size (M,Nmax).

For instance let's take the sparse tensor

M = 5
N = 7

i = [[0, 1, 1],
    [2, 0, 2]]

v =  [3, 4, 5]

s = torch.sparse_coo_tensor(i, v, (M, N))

In dense form, this is the (5,7) Tensor:

tensor([[0, 0, 3, 0, 0, 0, 0],
        [4, 0, 5, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0]])

The conversion can be done essentially linearly in the number of nonzero elements. For instance as follows (probably there is a better simpler way to do it):

ind = s.coalesce().indices().int() 
val = s.coalesce().values()

nse = len(val)    # Number of nonzero elements

max_nb = 2    # Maximal number of nonzero elements (known or to compute)

neigh = torch.zeros((M,max_nb))    # The neighbour tensor
nb_neigh = torch.zeros(M).int()    # The number of nonzero elements per row
values = torch.zeros((M,max_nb))    # The tensor of the values

for k in range(nse):
    neigh[ind[0,k],nb_neigh[ind[0,k]]] = ind[1,k]
    values[ind[0,k],nb_neigh[ind[0,k]]] = val[k]
    nb_neigh[ind[0,k]] += 1

print(neigh)
print(values)

which returns the two tensors equivalent to the COO format:

tensor([[2., 0.],
        [0., 2.],
        [0., 0.],
        [0., 0.],
        [0., 0.]])
tensor([[3., 0.],
        [4., 5.],
        [0., 0.],
        [0., 0.],
        [0., 0.]])

Finally, the desired LazyTensor can be constructed as follows.

  1. Reshape and convert the tensors into LazyTensors
neigh_lazy = LazyTensor(neigh[:,None,:])
values_lazy = LazyTensor(values[:,None,:])
  1. Create the LazyTensor [0,1,2...,N]
arange = LazyTensor(torch.arange(0,N,1,device=neigh.device,dtype=neigh.dtype)[None,:,None])
  1. The following (N,M,Nmax) LazyTensor is such that the (i,:,k) "vector" is 0 everywhere except on coordinate neigh[i,k] where it is equal to 1.
one_hot_like = (1 - (neigh_lazy - arange).sign().abs()

This is essentially what the function one_hot should do but I somehow did not manage to make it work (see https://github.com/getkeops/keops/issues/263).

  1. Finally it remains to multiply by the values and sum over the last dimension
lazy_adjacency = (one_hot_like*values_lazy).sum(-1)

On the previous example, we can check that it works for instance by computing the third column of the dense tensor:

print(lazy_adjacency)

ei = torch.zeros((N,1))
ei[2] = 1

print(lazy_adjacency @ ei)

returns

KeOps LazyTensor
    formula: Sum(((IntCst(1) - Abs(Sign((Var(0,2,0) - Var(1,1,1))))) * Var(2,2,0)))
    shape: (5, 7)
tensor([[3.],
        [5.],
        [0.],
        [0.],
        [0.]])

The problem with this solution is that it may become quite slow if the number of nonzero elements per row is too large (say above 100). This is not exactly surprising because KeOps tends to be not so efficient when the dimension is large (am I right @jeanfeydy ?). So I guess that a better solution would be to implement a function which does not requires to broadcast all the tensors into this artificial Nmax dimension, but I did not manage to do it. Any idea @jeanfeydy? Actually, I do not even know is KeOps is the best tool to do reductions with "sparse Lezy kernels"... Or probably it would requires a clever block sparse reduction method (i.e. updating the range argument) to get actual good performance.

Hope it helps you a bit! (I am not sure I perfectly understood the initial problem so sorry if my answer is completely off-topic, in this case I will just delete it and open an independent issue).

Antoine