getkeops / keops

KErnel OPerationS, on CPUs and GPUs, with autodiff and without memory overflows
https://www.kernel-operations.io
MIT License
1.03k stars 65 forks source link

KMeans - Distances #244

Closed nsriniva03 closed 2 years ago

nsriniva03 commented 2 years ago

How do you get the distances/cosine similarities between the cluster center from the Gram matrix symbol? I would like to return the cluster labels and distance from the kmeans function.

jeanfeydy commented 2 years ago

Hi @nsriniva03 ,

Thanks for your interest in this library. We provide a tutorial for K-Means with the cosine similarity metric here. The code is commented line-by-line: if you do not understand something, could you please tell us which parts block you?

Best regards, Jean

nsriniva03 commented 2 years ago

Hi Jean,

Below is example code from the K-Means tutorial,

def KMeans_cosine(x, K=10, Niter=10, verbose=True):
    """Implements Lloyd's algorithm for the Cosine similarity metric."""

    start = time.time()
    N, D = x.shape  # Number of samples, dimension of the ambient space

    c = x[:K, :].clone()  # Simplistic initialization for the centroids
    # Normalize the centroids for the cosine similarity:
    c = torch.nn.functional.normalize(c, dim=1, p=2)

    x_i = LazyTensor(x.view(N, 1, D))  # (N, 1, D) samples
    c_j = LazyTensor(c.view(1, K, D))  # (1, K, D) centroids

    # K-means loop:
    # - x  is the (N, D) point cloud,
    # - cl is the (N,) vector of class labels
    # - c  is the (K, D) cloud of cluster centroids
    for i in range(Niter):

        # E step: assign points to the closest cluster -------------------------
        S_ij = x_i | c_j  # (N, K) symbolic Gram matrix of dot products
        cl = S_ij.argmax(dim=1).long().view(-1)  # Points -> Nearest cluster

        # M step: update the centroids to the normalized cluster average: ------
        # Compute the sum of points per cluster:
        c.zero_()
        c.scatter_add_(0, cl[:, None].repeat(1, D), x)

        # Normalize the centroids, in place:
        c[:] = torch.nn.functional.normalize(c, dim=1, p=2)

    if verbose:  # Fancy display -----------------------------------------------
        if use_cuda:
            torch.cuda.synchronize()
        end = time.time()
        print(
            f"K-means for the cosine similarity with {N:,} points in dimension {D:,}, K = {K:,}:"
        )
        print(
            "Timing for {} iterations: {:.5f}s = {} x {:.5f}s\n".format(
                Niter, end - start, Niter, (end - start) / Niter
            )
        )

    return cl, c

cl - represents the cluster labels c - represents the cluster centroids

I would like to also return another variable called similarities. This variable should contain the cosine similarity between the x_i sample and the cluster center. Let's assume x_i sample belongs to cluster 1 (number of clusters is 10), then I would like to return the cosine similarity between x_i and c_1 (centroid of cluster 1). I am trying to figure out how to access the value of S_ij. Cause I am assuming that S_ij contains the dot product.

jeanfeydy commented 2 years ago

Hi @nsriniva03,

I see! In this case, what you are looking for is the .max() reduction. If you are also interested in cl, which is likely, you can compute both the value of the similarity with the nearest cluster and the label of the nearest cluster in one pass through the data with .max_argmax().

Finally, please note that you could also compute the cluster for each point using PyTorch advanced indexing and then compute the similarity “by hand” with a code along the lines of:

centroids = torch.gather(c, 0, cl.view(N,1).repeat(1,D))
similarities = (x * centroids).sum(-1)

Does this answers your question?

nsriniva03 commented 2 years ago

Hi Jean,

This answers my question. Thank you