getkeops / keops

KErnel OPerationS, on CPUs and GPUs, with autodiff and without memory overflows
https://www.kernel-operations.io
MIT License
1.05k stars 64 forks source link

Extracting a band diagonal with KeOps? #24

Open ibeltagy opened 5 years ago

ibeltagy commented 5 years ago

I want to do matrix multiplication of 2D tensors where I only care about a few diagonals of the resulting matrix, and I want to run this on GPU on PyTorch. Is this something that can be done with keops ?

For illustration, here's the numpy code, but what I really need is PyTorch/GPU

import numpy as np
M = 16000  # huge, can't do the O(n^2) operations
N = 64
c = 100  # number of diagonals, a lot smaller than M
t1 = np.random.rand(M, N)  
t2 = np.random.rand(M, N) 
r = np.zeros((M, d))

for i in range(M):
    for j in range(-c, c):  
        r[i][j] = np.dot(t1[i], t2[i + j])  # `(i + j)` should be `min(max(i + j, 0), M - 1)`
                                            # to take care of boundry condition, but let's skip that for now

PS: I am running into compilation errors with the tutorial examples, but will ask about that later.

ibeltagy commented 5 years ago

@bcharlier, @jeanfeydy,
Fixed the compilation errors, and I can run the tutorial examples fine.

Reading more about the syntax here https://www.kernel-operations.io/keops/api/math-operations.html, looks like the operation I need doesn't fit the format you expect. Is there a way around that?

Also, does keops support fp16?

Thanks

jeanfeydy commented 5 years ago

Hi @ibeltagy ,

Happy to see that you could install everything! If I understand your problem correctly, you are trying to compute efficiently a diagonal band (of width 2*c+1) of the Gram matrix t1 @ t2.T of your dataset. You could probably write this using a very ugly KeOps formula, with auxiliary integer variables "i" and "j". But as far as I can tell, a vanilla PyTorch code along the lines of:

for j in range(-c, c):
    r[:, j] = (t1 * t2[ j : M+j, :]).sum(dim=1)

would be a much simpler option, and reasonably efficient. Of course, you would need zero-pad t2 to take into account the boundary conditions. What do you think?

As for your second question: today, KeOps only supports fp32 ("float") and fp64 ("double") precisions. Adding support for other float numbers shouldn't be too hard, but we simply never thought about it. What do you think, @bcharlier ?

Best regards,

Jean

ibeltagy commented 5 years ago

This PyTorch solution is easy but it is super slow because of large number of slicing operations (2c+1 of them, and c is around 100), and it doesn't seem that pytoch is parallelizing them well.

ibeltagy commented 5 years ago

Another way of doing the same thing is using as_strided followed by tensor dot product. Something like this:

t1 = torch.randn(15901, 1, 64, device=cuda_device, requires_grad=True) 
t2 = torch.randn(16000, 64, device=cuda_device, requires_grad=True)
strides = t2.stride()
t2_strided = t2.as_strided(size=(15901, 100, 64), stride=(strides[0], strides[0], strides[1]))
t1.expand_as(t2_strided)

diagonals = (t1 * t2_strided).sum(dim=-1)  # memory intensive because PyTorch unnecessarily calls `t2_strided.contiguous()`

As far as I can tell, keops would be a good solution to avoid this excessive memory allocation, but do you know if it would be with a matrix that is the output of as_strided with some weird strides to get these sliding windows?

ibeltagy commented 5 years ago

You could probably write this using a very ugly KeOps formula, with auxiliary integer variables "i" and "j".

But would it be fast ?

Also about this,

for j in range(-c, c):
    r[:, j] = (t1 * t2[ j : M+j, :]).sum(dim=1)

I saw one example in your tutorials that has loops. Can this solution be implemented in keops? I guess it will result in a very long formula (with lots of sub-formulas, one for each value of j), but would it work?

jeanfeydy commented 5 years ago

Hi @ibeltagy ,

I've been thinking about it during the night: thanks for the interesting problem! I'm pretty sure that we'll be able to answer your question efficiently: I just need to implement a new one_hot(...) operator for KeOps Genred + LazyTensors. Then, a code along the lines of:

import torch
from pykeops.torch import LazyTensor

M, D, c = 100000, 64, 100
t1 = torch.randn(M, D).cuda()
t2 = torch.randn(M, D).cuda()
indices = torch.arange(M).float().cuda()

i = LazyTensor( indices[:,None,None] )
j = LazyTensor( indices[None,:,None] )
x = LazyTensor( t1[:,None,:] )
y = LazyTensor( t2[None,:,:] )

K_ij = (j - i + c).one_hot(2*c+1) * (x | y)
K_ij.ranges = blablabla  # block-sparsity pattern around the band diagonal

r = K_ij.sum(dim=1)

will probably be good enough, with streamlined computations and relatively few wasted operations. I'll work on it this morning, and keep you updated!

Best regards,

Jean

ibeltagy commented 5 years ago

@jeanfeydy, that's very cool. Thanks for helping.

Let me add two more requirements to make sure that your solution is going to address my use case.

1- A way to handle the boundary conditions. For example, tvm has a tvm.if_then_else (see code here https://docs.tvm.ai/tutorials/optimize/opt_conv_cuda.html) to compute the value or return a constant.

2- The real use case is that the diagonals I need are not necessarily a contagious band around the main diagonal. Would it be possible to provide a list of diagonal indices that I care about? the list is fixed and known during compilation time. This is probably your K_ij.ranges = blablabla but wanted to double check

Thanks again for doing this.

jeanfeydy commented 5 years ago

Hi @ibeltagy ,

I think that 26b4829727751acbee2378e3a48e247523e72e9a does the trick! I'm pretty busy right now and could not fully test everything, but I think that the code below answers your question:

import torch
from pykeops.torch import LazyTensor
use_cuda = torch.cuda.is_available()
load = lambda x : x.cuda() if use_cuda else x

M, D, c = 100000, 64, 100
t1 = load(torch.randn(M, D))
t2 = load(torch.randn(M, D))
indices = load(torch.arange(M).float())

i = LazyTensor( indices[:,None,None] )
j = LazyTensor( indices[None,:,None] )
x = LazyTensor( t1[:,None,:] )
y = LazyTensor( t2[None,:,:] )

K_ij = (j - i + c).one_hot(2*c+1) * (x | y)  # "Kernel" matrix

#r = K_ij.sum(dim=1)  # Bruteforce, quadratic computation

if True:  # Use a block-sparsity pattern around the band diagonal
    from math import ceil
    Mc = ceil(M / c)  # Number of blocks of size (c,c)
    I = load(torch.arange(Mc).int())
    # The lines below describe a block-sparse pattern of width
    # 3 around the diagonal, with blocks of size (c,c):

    # 1. "i" indices are grouped in clusters of size c:
    ranges_i = torch.stack( (I * c, (I+1) * c) ).clamp(0, M).t().contiguous() # [[0, c], [c, 2*c], ...]
    # 2. each "i" cluster is associated to one range over "j": 
    slices_i = load(torch.arange(1, Mc + 1).int())  # [1, 2, ..., Mc]
    # 3. the reduction ranges are given by [[0, 2*c], [0, 3*c], [c, 4*c], ...]:
    ranges_j = torch.stack( ((I-1) * c, (I+2) * c) ).clamp(0, M).t().contiguous()

    # N.B.: here, because the matrix is square, the pattern is symmetric for the transpose.
    K_ij.ranges = (ranges_i, slices_i, ranges_j, ranges_i, slices_i, ranges_j)

r = K_ij.sum(dim=1)  # Actual computation

# Test:
I, J = 34, 58
print(r[I, c + J - I])
print(torch.dot( t1[I], t2[J] ))

You can try it on Google Colab with the following imports in your first and second cells:

# Install the dependencies for sphinx and KeOps
!pip install numpy GPUtil cmake ninja > install.log
!pip install sphinx-gallery recommonmark sphinxcontrib-httpdomain sphinx_rtd_theme  >> install.log

# Download KeOps... And don't forget to update pybind11.
!git clone --recursive https://github.com/getkeops/keops.git keops/  >> install.log
# Put KeOps in the current environment
import sys
sys.path.append('/content/keops/')

As far as I can tell, this runs pretty fast. What do you think? As for your second question, it depends on the diagonals that you are looking for. Are they packed in collections of contiguous blocks (good), or are they completely random (ouch)? GPUs are heavily biased towards contiguous memory accesses, so even if you craft a nice formula to replace my K_ij = (j - i + c).one_hot(2*c+1) * (x | y): performances will mostly depend on whether or not computations can fit in a nice block-sparse reduction scheme.

Best, Jean

jeanfeydy commented 5 years ago

Oooops, just remarked a small bug in my C++ implementation: (int) x + 0.5 != round(x) if x is negative! I fix this quickly, and will edit the message + script above.

ibeltagy commented 5 years ago

Very cool. I will try this today and get back to you.

As for the diagonals, they are evenly spaced and the spacing is an argument. The example above is with spacing=0

jeanfeydy commented 5 years ago

I see! I have updated the instructions in the post above: everything should be fine now. As for your general problem: it is definitely doable, but performances will take a hit as spacing increases. I may have to implement a few arithmetic operations to make it work: I'll see this over the week-end :-)

ibeltagy commented 5 years ago

I ran the code above and it does work. I will test the gradients and do some profiling for the performance and get back to you. Thanks a lot for doing this. I really appreciate your help.

ibeltagy commented 5 years ago

I can the code vs the naive pytorch solution mentioned earlier, and it seems that keops is 10x slower, when I was expecting it to be 10x faster . The code and time estimates are below. Any suggestions what could be slowing it down? Thanks

from math import ceil
from pykeops.torch import LazyTensor
use_cuda = torch.cuda.is_available()
load = lambda x : x.cuda() if use_cuda else x
import time

# this function is called many times
def compute_mm_diagonals_keops(t1, t2, c, M, D, indices, ranges_i, slices_i, ranges_j):
  i = LazyTensor( indices[:,None,None] )
  j = LazyTensor( indices[None,:,None] )
  x = LazyTensor( t1[:,None,:] )
  y = LazyTensor( t2[None,:,:] )
  K_ij = (j - i + c).one_hot(2 * c + 1) * (x | y)  # "Kernel" matrix

  # N.B.: here, because the matrix is square, the pattern is symmetric for the transpose.
  K_ij.ranges = (ranges_i, slices_i, ranges_j, ranges_i, slices_i, ranges_j)
  r = K_ij.sum(dim=1)  # Actual computation
  return r

def compute_mm_diagonals_pytorch(t1, t2, c):
    seq_len = t1.size()[0]
    diagonals_list = []
    for i, d in enumerate(range(-c, c+1)):
        if d >= 0:
            slice_1 = t1.narrow(dim=0, start=0, length=seq_len - d)
            slice_2 = t2.narrow(dim=0, start=d, length=seq_len - d)
            pad_left = 0
            pad_right = d
        else:
            slice_1 = t1.narrow(dim=0, start=-d, length=seq_len + d)
            slice_2 = t2.narrow(dim=0, start=0, length=seq_len + d)
            pad_left = -d
            pad_right = 0
        diagonals_list.append(torch.nn.functional.pad((slice_1 * slice_2).sum(-1), pad=(pad_left, pad_right), value=0))
    r = torch.cat(diagonals_list, dim=-1).view(c * 2 + 1, seq_len).transpose(dim0=0, dim1=1)
    return r

# building the indices and slices happens only once
M, D, c = 16000, 64, 128
indices = load(torch.arange(M).float())
Mc = ceil(M / c)  # Number of blocks of size (c,c)
I = load(torch.arange(Mc).int())
# The lines below describe a block-sparse pattern of width
# 3 around the diagonal, with blocks of size (c,c):

# 1. "i" indices are grouped in clusters of size c:
ranges_i = torch.stack( (I * c, (I+1) * c) ).clamp(0, M).t().contiguous() # [[0, c], [c, 2*c], ...]
# 2. each "i" cluster is associated to one range over "j": 
slices_i = load(torch.arange(1, Mc + 1).int())  # [1, 2, ..., Mc]
# 3. the reduction ranges are given by [[0, 2*c], [0, 3*c], [c, 4*c], ...]:
ranges_j = torch.stack( ((I-1) * c, (I+2) * c) ).clamp(0, M).t().contiguous()

# call it once to compile 
t1 = load(torch.randn(M, D))
t2 = load(torch.randn(M, D))
r = compute_mm_diagonals_keops(t1, t2, c, M, D, indices, ranges_i, slices_i, ranges_j)

# call many times and measure time
time1 = 0
time2 = 0
for i in range(100):
  t1 = load(torch.randn(M, D))
  t2 = load(torch.randn(M, D))
  torch.cuda.synchronize()
  start = time.time()
  r = compute_mm_diagonals_keops(t1, t2, c, M, D, indices, ranges_i, slices_i, ranges_j)
  torch.cuda.synchronize()
  time1 += time.time() - start

  start = time.time()
  r2 = compute_mm_diagonals_pytorch(t1, t2, c)
  torch.cuda.synchronize()
  time2 += time.time() - start

  if not torch.allclose(r, r2, atol=1e-05):
    print('not equal')
print(f'Time keops: {time1} s')
print(f'Time pytorch: {time2} s')
Time keops: 18.393823623657227 s
Time pytorch: 1.8735544681549072 s
jeanfeydy commented 5 years ago

Hi @ibeltagy ,

Indeed! Running some benchmarks (c = 30, c = 100, K_ij = (x | y), etc.), I'm pretty sure that these low performances are due to my heavy-handed one-hot encoding. That was a useful operation to implement anyway, but in this situation, performing a product + summation on a large vector of size 257 at every step is just way too inefficient. Fortunately, reaching nearly-optimal performances shouldn't be too hard: I'll just have to implement a new scattered_sum reduction. Writing the C++ won't be difficult, but I'd rather think a little bit about a nice LazyTensor user-interface before adding it to master. I'll try to start implementing it tomorrow!

Best regards, Jean

ibeltagy commented 5 years ago

that would be cool. Thanks

jeanfeydy commented 5 years ago

Hi @ibeltagy , I implemented a prototype in 40217c1133cddceac5cd537f20dfc68ba6a639e7 (branch scattered_sum): the new syntax looks like:

def compute_mm_diagonals_keops(t1, t2, c, M, D, indices, ranges_i, slices_i, ranges_j):
    i = LazyTensor( indices[:,None,None] )
    j = LazyTensor( indices[None,:,None] )
    x = LazyTensor( t1[:,None,:] )
    y = LazyTensor( t2[None,:,:] )
    K_ij = (x | y)
    # N.B.: here, because the matrix is square, the pattern is symmetric for the transpose.
    K_ij.ranges = (ranges_i, slices_i, ranges_j, ranges_i, slices_i, ranges_j)

    r = K_ij.scattered_sum(j - i + c, 2*c+1, dim=1)  # Actual computation
    return r

Performances vary with M, D and c, but in your use-case (M, D, c = 16000, 64, 128), I get a x6 speed-up vs PyTorch on Google Colab:

Time keops: 0.7824671268463135 s
Time pytorch: 4.656574964523315 s

Now, I "just" have to document all this, implement the gradient and think a bit about the spacing parameter :-)

Best regards, Jean

jeanfeydy commented 5 years ago

Gradient is implemented now (9c61d633d9de588862f38ab90c22f70af4357a60): just have to document everything properly, add a unit test and implement the spacing!

bcharlier commented 4 years ago

The problem seems to be solved. Feel free to re-open the issue if there is any question regarding band diagonal extraction.

awav commented 3 years ago

Hi, @bcharlier and @jeanfeydy. Are you planning to merge _scatteredsum branch into the master and add it to the next release?

jeanfeydy commented 3 years ago

Hi @awav ,

Indeed! The main cause for this delay has been the CUDA 11 release this summer, which has introduced a significant regression on compilation times for KeOps formulas and made our continuous integration pipeline really impractical to run (from ~1 hour to ~5 hours per commit at some point in September...). In September-December, we have prioritized this issue and the release of our NeurIPS paper over the integration of new operations. Fortunately, the conference is now over and @bcharlier and @joanglaunes have made a lot of progress on compilation times over the last few weeks. Once their two_stage_compile branch is integrated in master (with a x4-5 speed-up on compile times), I'll make sure to write unit tests for scattered_sum and add it to master :-)

Out of curiosity, could you maybe tell us a little bit more about your intended use case? I understand that you are mostly working with TensorFlow: if you believe that writing a KeOps binder for TF is do-able and worth doing, we'd be glad to get your opinion on this!

Best regards, Jean

fschlatt commented 1 year ago

Hey @jeanfeydy ,

I'm trying to get the same diagonal band operation as @ibeltagy . @ibeltagy has in the meantime found a solution using plain pytorch by chunking matrices. This however uses more 2x more memory than necessary and requires padding the matrices to the window size. I've now tried to get the scattered_sum branch merged into the main branch. The code base has changed considerably in the past 2 years though :smile:

Would you recommend using the old branch or have the changes brought considerable improvements? If so, what would the effort be for integrating the scattered sum reduction in the new codebase? Could you give me pointers on where to start?

Best regards

Ferdinand

fschlatt commented 1 year ago

I can the code vs the naive pytorch solution mentioned earlier, and it seems that keops is 10x slower, when I was expecting it to be 10x faster . The code and time estimates are below. Any suggestions what could be slowing it down? Thanks

from math import ceil
from pykeops.torch import LazyTensor
use_cuda = torch.cuda.is_available()
load = lambda x : x.cuda() if use_cuda else x
import time

# this function is called many times
def compute_mm_diagonals_keops(t1, t2, c, M, D, indices, ranges_i, slices_i, ranges_j):
  i = LazyTensor( indices[:,None,None] )
  j = LazyTensor( indices[None,:,None] )
  x = LazyTensor( t1[:,None,:] )
  y = LazyTensor( t2[None,:,:] )
  K_ij = (j - i + c).one_hot(2 * c + 1) * (x | y)  # "Kernel" matrix

  # N.B.: here, because the matrix is square, the pattern is symmetric for the transpose.
  K_ij.ranges = (ranges_i, slices_i, ranges_j, ranges_i, slices_i, ranges_j)
  r = K_ij.sum(dim=1)  # Actual computation
  return r

def compute_mm_diagonals_pytorch(t1, t2, c):
    seq_len = t1.size()[0]
    diagonals_list = []
    for i, d in enumerate(range(-c, c+1)):
        if d >= 0:
            slice_1 = t1.narrow(dim=0, start=0, length=seq_len - d)
            slice_2 = t2.narrow(dim=0, start=d, length=seq_len - d)
            pad_left = 0
            pad_right = d
        else:
            slice_1 = t1.narrow(dim=0, start=-d, length=seq_len + d)
            slice_2 = t2.narrow(dim=0, start=0, length=seq_len + d)
            pad_left = -d
            pad_right = 0
        diagonals_list.append(torch.nn.functional.pad((slice_1 * slice_2).sum(-1), pad=(pad_left, pad_right), value=0))
    r = torch.cat(diagonals_list, dim=-1).view(c * 2 + 1, seq_len).transpose(dim0=0, dim1=1)
    return r

# building the indices and slices happens only once
M, D, c = 16000, 64, 128
indices = load(torch.arange(M).float())
Mc = ceil(M / c)  # Number of blocks of size (c,c)
I = load(torch.arange(Mc).int())
# The lines below describe a block-sparse pattern of width
# 3 around the diagonal, with blocks of size (c,c):

# 1. "i" indices are grouped in clusters of size c:
ranges_i = torch.stack( (I * c, (I+1) * c) ).clamp(0, M).t().contiguous() # [[0, c], [c, 2*c], ...]
# 2. each "i" cluster is associated to one range over "j": 
slices_i = load(torch.arange(1, Mc + 1).int())  # [1, 2, ..., Mc]
# 3. the reduction ranges are given by [[0, 2*c], [0, 3*c], [c, 4*c], ...]:
ranges_j = torch.stack( ((I-1) * c, (I+2) * c) ).clamp(0, M).t().contiguous()

# call it once to compile 
t1 = load(torch.randn(M, D))
t2 = load(torch.randn(M, D))
r = compute_mm_diagonals_keops(t1, t2, c, M, D, indices, ranges_i, slices_i, ranges_j)

# call many times and measure time
time1 = 0
time2 = 0
for i in range(100):
  t1 = load(torch.randn(M, D))
  t2 = load(torch.randn(M, D))
  torch.cuda.synchronize()
  start = time.time()
  r = compute_mm_diagonals_keops(t1, t2, c, M, D, indices, ranges_i, slices_i, ranges_j)
  torch.cuda.synchronize()
  time1 += time.time() - start

  start = time.time()
  r2 = compute_mm_diagonals_pytorch(t1, t2, c)
  torch.cuda.synchronize()
  time2 += time.time() - start

  if not torch.allclose(r, r2, atol=1e-05):
    print('not equal')
print(f'Time keops: {time1} s')
print(f'Time pytorch: {time2} s')
Time keops: 18.393823623657227 s
Time pytorch: 1.8735544681549072 s

Additionally, when testing the benchmarking code above, I get a stack smashing detected error:

[KeOps] Warning : Cuda libraries were not detected on the system ; using cpu only mode
*** stack smashing detected ***: terminated
Aborted (core dumped)