getkeops / keops

KErnel OPerationS, on CPUs and GPUs, with autodiff and without memory overflows
https://www.kernel-operations.io
MIT License
1.02k stars 65 forks source link

KeOps Block Sparsity: throwing CUDA error #350

Open kvndhrty opened 6 months ago

kvndhrty commented 6 months ago

Hi KeOps team,

I'm moving our PyTorch layer QuadConv to keops to save memory overhead and boost speed. Thus far everything has been fairly smooth, and we see big speed increases on the CPU.

When I move this code to the GPU it throws a inscrutable CUDA error at runtime, but only if I use the ranges field of the LazyTensor before I perform the reduction. Otherwise, without the block sparsity feature this runs smoothly on the GPU.

from pykeops.torch import LazyTensor
from pykeops.torch.cluster import grid_cluster, from_matrix, sort_clusters, cluster_ranges_centroids
import torch

## DEFINE VARS AND FUNCTIONS
def apply_lazy_mlp(x, lazy_list):
    """
    Applies a list of lazy tensors to x
    """
    out = x
    for lazy in lazy_list:
        out = lazy.matvecmult(out).sin()
    return out

B, N, M, I, J = 8, 2500, 2500, 4, 16  # Batch, Sample size, Sample Size, Channel in, Channel out

x = torch.randn(N, 2).cuda()
y = torch.randn(M, 2).cuda()
f = torch.randn(B, N, I).cuda()
rho = torch.randn(1, N, 1).cuda()

L1 = LazyTensor(torch.randn(16).cuda())
L2 = LazyTensor(torch.randn(64).cuda())
L3 = LazyTensor(torch.randn(8*I*J).cuda())

lazy_list = [L1, L2, L3]

## MAKE RANGES

eps = 0.1  # Size of our square bins

alpha = 1/100

x_labels = grid_cluster(x, eps)  # class labels
y_labels = grid_cluster(y, eps)  # class labels

x_ranges, x_centroids, _ = cluster_ranges_centroids(x, x_labels)
y_ranges, y_centroids, _ = cluster_ranges_centroids(y, y_labels)

x, x_labels = sort_clusters(x, x_labels)
y, y_labels = sort_clusters(y, y_labels)

keep = (((y_centroids[None, :, :]-x_centroids[:, None, :])**2).sum(-1) - 1 / alpha) < 0

ranges_ij = from_matrix(x_ranges, y_ranges, keep)

## RUN KEOPS CODE
x_i = LazyTensor(x.view(N, 1, 2))
y_j = LazyTensor(y.view(1, M, 2))

f_i = LazyTensor(f.view(B, N, 1, I))

ex_tensor = LazyTensor(torch.ones(1, 1, 1, J).cuda())

ex_f_i = f_i.keops_kron(ex_tensor, [I], [J])

rho_i = LazyTensor(rho.view(N, 1, 1))

G_ij = apply_lazy_mlp(x_i - y_j, lazy_list) * rho_i * ex_f_i

#G_ij.ranges = ranges_ij

out = G_ij.sum_reduction(axis=1).reshape(B, M, I, J).sum(dim=-2)

Commenting in the second to last line makes this code die on my machine.

You can strip down G_ij to something like:

G_ij = apply_lazy_mlp(x_i - y_j, lazy_list)

and it will run with the ranges argument, so the issue seems to be the shape of tensor G_ij and its interaction with ranges but this error only occurs on CUDA.

Any advice?

kvndhrty commented 6 months ago

@bcharlier suggested I install via github to see if that solved my problem. I was running with the tagged 2.1.2 version of keops, installed via pip, but updated to commit dcec73cb066ecf4c13efe7612a585022123d6224 and my results are the same as before.

bcharlier commented 5 months ago

Hi @kvndhrty , the problems come from the kron product... I will test further how to fix that

bcharlier commented 5 months ago

after some investigations, this is a NotImplemented Error: This is not possible to use ranges and batch dimension at the same time...