getkeops / keops

KErnel OPerationS, on CPUs and GPUs, with autodiff and without memory overflows
https://www.kernel-operations.io
MIT License
1.04k stars 63 forks source link

LazyTensor dot product with itself #133

Open awav opened 3 years ago

awav commented 3 years ago

Hi everyone,

I have a slight issue with the matrix product evaluated on lazy tensors. Let's K is a [N, M] matrix - the result of a distance function on inputs x and y with shapes [N, 1] and [M, 1] accordingly, where N >> M, e.g. N >= 1e6, and M ~= 1000.

...
xi = LazyTensor(x[:, None, :])
yj = LazyTensor(y[None, :, :])
K = ((xi - yj) ** 2).sum(dim=2)
KK = K @ K.T

As far as I understand, there is a way to compute it with tensordot and it requires flat matrices, but there is no reshape method for a lazy tensor and @ gives an error (which kind of saying you cannot reshape this tensor):

AttributeError: 'LazyTensor' object has no attribute 'view'

Is there a way to compute the final matrix K @ K.T using solely symbolic representations without ever materializing huge K? I would appreciate any help! Thanks!

jeanfeydy commented 3 years ago

Hi @awav,

This is a very good question! You are running into one of the limitations of the inner “KeOps++” engine, which currently only handles two symbolic axes at a time. As of today, there are two main workarounds to this problem:

  1. Use a “LinearOperator-like” abstraction, as in #124. If you only intend to use “KK” for matrix-vector or matrix-matrix products, this is actually a fine solution. You could simply define a linear Python function with:

    def KK(x):
    return K @ (K.T @ x)

    And use it in your favorite linear solver. If I remember correctly, this is more or less what is done in the KeOps, GPyTorch+KeOps and Falkon conjugate gradient solvers. Note that the parenthesising here is important. We do not create a symbolic LazyTensor (K @ K.T) on which we could apply any type of mathematical operation as this would require a significant, non-trivial improvement of our CUDA engine. Instead, we simply apply two KeOps operations one after the other. Note that #35 addresses this use case in detail: @joanglaunes implemented a differentiable .sqsolve() operator to answer @adam-coogan ‘s question.

  2. If M is small enough (less than 100 ideally, but 1,000 should still be OK with @joanglaunes ’s recent improvements to our engine), you could use a two-step implementation. First, compute K as a (N,M) dense array using standard PyTorch computations. Then, use KeOps to create a fully-fledged symbolic LazyTensor KK with:

    
    from pykeops.torch import LazyTensor
    K_i = LazyTensor(K.view(N,1,M))
    K_j = LazyTensor(K.view(1,N,M))

KK = (K_i | K_j) # (N,N,1) LazyTensor of dot products between the lines and columns of K



What do you think?

As a side note, would you and your colleagues be interested by a “Zoom” meeting at some point? The “British” Covid variant is keeping me away from London at the moment, but I’d be very happy to know more about your problems and see if we can help in any way :-) 

Best regards,
Jean
awav commented 3 years ago

Hi, @jeanfeydy! Thanks for the quick response. Unfortunately, I need to materialize the result of K @ K.T and perform Cholesky afterwards. M is small but in the range of (100, 1e4). Is there a way to slice lazy tensors? :)

As a side note, would you and your colleagues be interested by a “Zoom” meeting at some point?

Yes, it would be great!

jeanfeydy commented 3 years ago

Hi @awav,

I see! There is no clean API to slice LazyTensors just yet, but workarounds shouldn't be too hard to implement: the operation is equivalent to slicing the data arrays themselves prior to the reduction. We'll probably implement the necessary syntactic sugar in the LazyTensor module once @joanglaunes is done with his refactoring of our compilation engine.

In your case though, I don't know if this would be truly useful. If you have to materialize K @ K.T to invert it explicitely, KeOps probably won't be of any use for this step of your computation. Most of our speed-ups come from the fact that we never have to move an "N-by-N" memory buffer from the compute units to the GPU memory, which would not be applicable here.

If you haven't seen it already, I believe that you could be interested by @Giodiro 's Falkon repository. It is an impressive implementation of the Nyström method that combines efficient matrix-vector multiplications (either cuBLAS or KeOps) for the rectangular "1G-by-100k" products with a scalable Cholesky GPU solver for the inversion of the "100k-by-100k" square matrices.

What do you think?

P.S.: I'll send you an e-mail then!