getkeops / keops

KErnel OPerationS, on CPUs and GPUs, with autodiff and without memory overflows
https://www.kernel-operations.io
MIT License
1.02k stars 65 forks source link

cuCtxSynchronize() failed with error CUDA_ERROR_ILLEGAL_ADDRESS (argKmin) #364

Closed lindafei01 closed 3 months ago

lindafei01 commented 3 months ago

Hi, I encountered “[KeOps] error: cuCtxSynchronize() failed with error CUDA_ERROR_ILLEGAL_ADDRESS” when using “idx = pairwise_distance_ij.argKmin(K=k, axis=1).

The relevant function is as below.

def ranges_slices(batch):
    """Helper function for the diagonal ranges function."""
    if 'mps' in str(batch.device):
        Ns = batch.to('cpu').bincount()

    else:
        Ns = batch.bincount()

    indices = Ns.cumsum(0)
    ranges = torch.cat((0 * indices[:1], indices))
    ranges = (
        torch.stack((ranges[:-1], ranges[1:])).t().int().contiguous().to(batch.device)
    )
    slices = (1 + torch.arange(len(Ns))).int().to(batch.device)

    return ranges, slices

def diagonal_ranges(batch_x=None, batch_y=None):
    """Encodes the block-diagonal structure associated to a batch vector."""

    if batch_x is None and batch_y is None:
        return None  # No batch processing
    elif batch_y is None:
        batch_y = batch_x  # "symmetric" case

    ranges_x, slices_x = ranges_slices(batch_x)
    ranges_y, slices_y = ranges_slices(batch_y)

    return ranges_x, slices_x, ranges_y, ranges_y, slices_y, ranges_x

def knn_atoms(x, y, x_batch, y_batch, k):
    k = k+1
    N, D = x.shape
    x_i = LazyTensor(x[:, None, :])
    y_j = LazyTensor(y[None, :, :])

    pairwise_distance_ij = ((x_i - y_j) ** 2).sum(-1)
    pairwise_distance_ij.ranges = diagonal_ranges(x_batch, y_batch)

    # N.B.: KeOps doesn't yet support backprop through Kmin reductions...
    # dists, idx = pairwise_distance_ij.Kmin_argKmin(K=k,axis=1)
    # So we have to re-compute the values ourselves:
    idx = pairwise_distance_ij.argKmin(K=k, axis=1)  # (N, K)
    x_ik = y[idx.view(-1)].view(N, k, D)
    dists = ((x[:, None, :] - x_ik) ** 2).sum(-1)

    return idx, dists
jeanfeydy commented 3 months ago

Hi @lindafei01 ,

Thanks for your interest in our work. Since ranges_slice and diagonal_ranges are directly copy-pasted from the code I wrote for dMaSIF, the most likely explanation here is that there is a problem with x_batch or y_batch. The ranges syntax is very low-level and directly manipulates pointers (without any checks for performance reasons), so any misunderstanding of our conventions throws CUDA_ERROR_ILLEGAL_ADDRESS errors (i.e. segmentation faults).

Could you maybe provide us with a minimal non-working example including specific values for x, y, x_batch, y_batch, k?

Best regards, Jean

lindafei01 commented 3 months ago

Thanks for your prompt response! I've identified the problem in my implementation of x_batch and y_batch. The problem is solved now.