getkeops / keops

KErnel OPerationS, on CPUs and GPUs, with autodiff and without memory overflows
https://www.kernel-operations.io
MIT License
1.04k stars 64 forks source link

argKmin error when using multiple gpus #195

Closed FreyrS closed 2 years ago

FreyrS commented 2 years ago

Hi!

I'm using keops 1.5, pytorch 1.9, cuda 10.2, gcc 8.4, cmake 3.16.5, python 3.7.7. I have this piece of code that I have troubles with scaling up to more than a single gpu. I'm using DDP to distribute data over two gpus. Here's the code that fails:

def knn(x, y, x_batch, y_batch, k=5):
    N, D = x.shape
    print("DEBUG XY",x.shape,y.shape,x.device,y.device)
    print("DEBUG BATCH",x_batch.shape,y_batch.shape,x_batch.device,y_batch.device)
    print("DEBUG BATCH",x_batch,y_batch)
    x_i = LazyTensor(x[:, None, :])
    y_j = LazyTensor(y[None, :, :])

    pairwise_distance_ij = ((x_i - y_j) ** 2).sum(-1)
    pairwise_distance_ij.ranges = diagonal_ranges(x_batch, y_batch)
    idx = pairwise_distance_ij.argKmin(K=k, axis=1)  # (N, K)
    x_ik = y[idx.view(-1)].view(N, k, D)
    dists = ((x[:, None, :] - x_ik) ** 2).sum(-1).sqrt()

    return idx, dists

The output from the print statements is: DEBUG XY torch.Size([20976, 3]) torch.Size([20976, 3]) cuda:0 cuda:0 DEBUG BATCH torch.Size([20976]) torch.Size([20976]) cuda:0 cuda:0 DEBUG BATCH tensor([0, 0, 0, ..., 7, 7, 7], device='cuda:0') tensor([0, 0, 0, ..., 7, 7, 7], device='cuda:0')

And the same for the other gpu where cuda:0 is replaced by cuda:1 (and different number of points).

The error message I get is the following:

image

It looks to me like keops is trying to return the argmin rather than K=5 like I'm asking it to, this is not a problem on a single gpu.

I've tried running pykeops.clean_pykeops() and re-compile my formulas, while re-compiling the output is:

[pyKeOps] Compiling libKeOpstorcha1115c1005 in /home/sverriss/.cache/pykeops-1.5-cpython-37-gpu0-1:
       formula: ArgKMin_Reduction(Sum(Square((Var(0,3,0) - Var(1,3,1)))),5,0)
       aliases: Var(0,3,0); Var(1,3,1);
       dtype  : float32
...
[pyKeOps] Compiling libKeOpstorchc222c45fd5 in /home/sverriss/.cache/pykeops-1.5-cpython-37-gpu0-1:
       formula: Sum_Reduction(Exp(Minus(Sqrt(Sum(Square((Var(0,3,0) - Var(1,3,1))))))),1)
       aliases: Var(0,3,0); Var(1,3,1);
       dtype  : float32
...

Any ideas what could be going wrong here?

Thanks! Freyr

jeanfeydy commented 2 years ago

Hi @FreyrS ,

I see! Unfortunately, I was not able to reproduce the bug on my server ("rosella") with CUDA 10.2, PyTorch 1.9, gcc 7.5.0, cmake 3.18.4 and Python 3.7.8.

For reference, the code below works perfectly fine for me:

import torch
from pykeops.torch import LazyTensor

def ranges_slices(batch):
    """Helper function for the diagonal ranges function."""
    Ns = batch.bincount()
    indices = Ns.cumsum(0)
    ranges = torch.cat((0 * indices[:1], indices))
    ranges = (
        torch.stack((ranges[:-1], ranges[1:])).t().int().contiguous().to(batch.device)
    )
    slices = (1 + torch.arange(len(Ns))).int().to(batch.device)

    return ranges, slices

def diagonal_ranges(batch_x=None, batch_y=None):
    """Encodes the block-diagonal structure associated to a batch vector."""

    if batch_x is None and batch_y is None:
        return None  # No batch processing
    elif batch_y is None:
        batch_y = batch_x  # "symmetric" case

    ranges_x, slices_x = ranges_slices(batch_x)
    ranges_y, slices_y = ranges_slices(batch_y)

    return ranges_x, slices_x, ranges_y, ranges_y, slices_y, ranges_x

def knn(x, y, x_batch, y_batch, k=5):
    N, D = x.shape
    print("DEBUG XY",x.shape,y.shape,x.device,y.device)
    print("DEBUG BATCH",x_batch.shape,y_batch.shape,x_batch.device,y_batch.device)
    print("DEBUG BATCH",x_batch,y_batch)
    x_i = LazyTensor(x[:, None, :])
    y_j = LazyTensor(y[None, :, :])

    pairwise_distance_ij = ((x_i - y_j) ** 2).sum(-1)
    pairwise_distance_ij.ranges = diagonal_ranges(x_batch, y_batch)
    idx = pairwise_distance_ij.argKmin(K=k, axis=1)  # (N, K)
    x_ik = y[idx.view(-1)].view(N, k, D)
    dists = ((x[:, None, :] - x_ik) ** 2).sum(-1).sqrt()

    return idx, dists

for i in range(torch.cuda.device_count()):
    device = f"cuda:{i}"
    N, M, D = 2, 3, 3

    x = torch.randn(N, D).to(device)
    y = torch.randn(M, D).to(device)
    x_batch = torch.zeros(N).int().to(device)
    y_batch = torch.zeros(N).int().to(device)

    print(knn(x, y, x_batch, y_batch, k=5), "\n")

What about you? Depending on your answer, we may able to pin down the bug to PyTorch's DDP system or to some version mismatch. In any case, @joanglaunes will release a "beta" version of our new compilation engine very soon - this may provide us with a fresh start as far as CUDA / CMake-related problems are concerned.

Best regards, Jean

FreyrS commented 2 years ago

In collaboration with @jeanfeydy we figured out that the x_batch and y_batch were actually being allocated on the cpu while x and y were on the gpu and this was causing segmentation errors. The reason for the location on the cpu was a bug in a new version of torch_geometric. The corresponding issue is at https://github.com/pyg-team/pytorch_geometric/issues/3213.