K-means with kmeans++ initialization

Hello @pmisiakos , Sorry for the late reply, and thank you for this question because indeed our K-means tutorial is a bit too simplistic and we can definitely improve it. Thinking about K-means++, I tried first to implement it with KeOps but then realized that KeOps is not needed for this algorithm. Indeed the K-means++ algorithm is iterative, so there is no way to avoid the global loop to find a new cluster center one by one. Now at each iteration, one can avoid memory overflows by only computing the squared distances between all points and the center from the previous iteration, then computing the minimum with a stored array of minimal distances, all these being O(N) in memory and computation. So KeOps is useless for KMeans++, but I did tests with a PyTorch implementation of the algorithm, because the interesting question now is to know if it can be fast enough to be interesting compared to random initialization. Below is the code, where I also coded a convergence stopping rule for the main KMeans algorithm. Note that I do not guarantee completely this code ; it looks like it works ok, but it would require further testing ! Doing experiments with random data, it turns out that for N=1e6, K=1000, D=100, the Kmeans++ initialization takes around 9 seconds to complete, while each iteration of the main Kmeans algorithm takes about 0.1 second. This means that Kmeans++ initialization will be interesting only if it saves more than 90 iterations of the main Kmeans loop. For random (uniform or normal) data, this does not appear to be the case, but my intuition is that this is expected. Do you know standard datasets that show clearly the improvement of Kmeans++ initialization over random initialization ?

Here is the code (adapated from our Kmeans tutorial) :

########################################################################
# Setup
# -----------------
# Standard imports:

import time
import torch
from matplotlib import pyplot as plt
from pykeops.torch import LazyTensor

use_cuda = torch.cuda.is_available()
dtype = torch.float32 if use_cuda else torch.float64
device_id = "cuda:0" if use_cuda else "cpu"

########################################################################
# Simple implementation of the K-means algorithm:

def KMeans(x, K=10, Niter=100, verbose=True, init="default"):
    """Implements Lloyd's algorithm for the Euclidean metric."""
    N, D = x.shape  # Number of samples, dimension of the ambient space

    if verbose:
        print(
            f"K-means for the Euclidean metric with {N:,} points in dimension {D:,}, K = {K:,}:"
        )
        start = time.time()

    if init=="default":
        c = x[:K, :].clone()  # Simplistic initialization for the centroids
    elif init=="kmeans++":
        c = kmeans_plus_plus(x,K)    

    if verbose:
        if use_cuda:
            torch.cuda.synchronize()
        end = time.time()
        print(
            "time for {} initialization: {:.5f}s".format(init,end-start)
        )

    start = time.time()
    x_i = LazyTensor(x.view(N, 1, D))  # (N, 1, D) samples
    c_j = LazyTensor(c.view(1, K, D))  # (1, K, D) centroids

    # E step: assign points to the closest cluster -------------------------
    D_ij = ((x_i - c_j) ** 2).sum(-1)  # (N, K) symbolic squared distances
    cl = D_ij.argmin(dim=1).long().view(-1)  # Points -> Nearest cluster

    # K-means loop:
    # - x  is the (N, D) point cloud,
    # - cl is the (N,) vector of class labels
    for i in range(Niter):

        # keep record of current cl for detecting convergence
        cl_old = cl.clone()

        # M step: update the centroids to the normalized cluster average: ------
        # Compute the sum of points per cluster:
        c.zero_()
        c.scatter_add_(0, cl[:, None].repeat(1, D), x)

        # Divide by the number of points per cluster:
        Ncl = torch.bincount(cl, minlength=K).type_as(c).view(K, 1)
        c /= Ncl  # in-place division to compute the average

        # E step: assign points to the closest cluster -------------------------
        D_ij = ((x_i - c_j) ** 2).sum(-1)  # (N, K) symbolic squared distances
        cl = D_ij.argmin(dim=1).long().view(-1)  # Points -> Nearest cluster

        # detect convergence
        if torch.all(cl==cl_old):
            Niter = i
            break

    if Niter>0 and verbose:  # Fancy display -----------------------------------------------
        if use_cuda:
            torch.cuda.synchronize()
        end = time.time()
        print(
            "Timing for {} iterations: {:.5f}s = {} x {:.5f}s\n".format(
                Niter, end - start, Niter, (end - start) / Niter
            )
        )

    return cl, c

def kmeans_plus_plus(x, K):
    # Kmeans++ initialization
    N, D = x.shape
    c = torch.empty(K, D, dtype=x.dtype, device=x.device)
    # 1. Choose one center uniformly at random among the data points.
    ind = int(torch.floor(torch.rand(1)*N))
    c[0,:] = x[ind,:]
    # 2. For each data point x not chosen yet, compute D(x)^2, 
    #    the squared distance between x and the nearest center that has already been chosen.
    # N.B. sq_dists is initialized with infinity values and will be updated through iterations
    sq_dists = 1/torch.zeros(N, device=x.device)
    # N.B. invarangeN below is used later in step 3
    invarangeN = torch.arange(N,0,-1, device=x.device, dtype=torch.float32)
    for k in range(K-1):
        sq_dists = torch.minimum(sq_dists, ((x - c[k,:]) ** 2).sum(-1))
        # 3. Choose one new data point at random as a new center, 
        #    using a weighted probability distribution where a point x 
        #    is chosen with probability proportional to D(x)^2.
        distrib = torch.cumsum(sq_dists, dim=0)
        ind = torch.argmax(invarangeN * (float(torch.rand(1))*distrib[-1] < distrib))
        c[k+1,:] = x[ind,:]
    return c

###############################################################
# K-means in 2D
# ----------------------
# First experiment with N=10,000 points in dimension D=2, with K=50 classes:
#
N, D, K = 10000, 2, 50

###############################################################
# Define our dataset:
x = 0.7 * torch.randn(N, D, dtype=dtype, device=device_id) + 0.3

###############################################################
# Perform the computation:
cl, c = KMeans(x, K, Niter=1000)
cl, c = KMeans(x, K, Niter=1000, init="kmeans++")

###############################################################
# Fancy display:

plt.figure(figsize=(8, 8))
plt.scatter(x[:, 0].cpu(), x[:, 1].cpu(), c=cl.cpu(), s=30000 / len(x), cmap="tab10")
plt.scatter(c[:, 0].cpu(), c[:, 1].cpu(), c="black", s=50, alpha=0.8)
plt.axis([-2, 2, -2, 2])
plt.tight_layout()
plt.show()

####################################################################
# K-means in dimension 100
# -------------------------
# Second experiment with N=1,000,000 points in dimension D=100, with K=1,000 classes:

if use_cuda:
    N, D, K = 1000000, 100, 1000
    x = torch.randn(N, D, dtype=dtype, device=device_id)
    cl, c = KMeans(x, K, Niter=1000)
    cl, c = KMeans(x, K, Niter=1000, init="kmeans++")

getkeops / keops

K-means with kmeans++ initialization #166