cvxgrp / pymde

Minimum-distortion embedding with PyTorch
https://pymde.org
Apache License 2.0
526 stars 27 forks source link

Using pre-computed knn with preserve_neighbors #65

Open jlmelville opened 1 year ago

jlmelville commented 1 year ago

Hello, is it possible to use preserve_neighbors with pre-computed k-nearest neighbors data in the distance-matrix-like form used by the likes of UMAP and t-SNE? In this format, the indices and distances are stored as separate matrices of shape n_objects, n_neighbors. The element (i, j) in the index matrix is the index of the jth nearest neighbor of i, and the equivalent element in the distance matrix is the distance between those two objects.

I (think) I can work out how to manually convert that into the edges/weights form used by pymde.Graph (various acts of melting and raveling), but I don't get the results I expect compared to passing the data directly. I think that's because when preprocess.generic.k_nearest_neighbors is called, for the graph code path, graph.k_nearest_neighbors has graph_distances=True. So another way to put it is: is there a way to pass a Graph so that it is interpreted as a distance matrix? This would be helpful because calculating the k-nearest neighbors is usually the slowest part of this sort of dimensionality reduction method.

akshayka commented 1 year ago

@jlmelville, sorry for the late response. Yes, with PyMDE, you can use pre-computed k-nearest neighbor data, with the caveat that you'll need to manually wire up the MDE problem (instead of calling preserve_neighbors).

This notebook has an example that shows how to make a neighborhood preserving embedding from scratch. Start reading from the section called "Embeddings, from scratch", which shows how to create the k-nearest neighbor graph:

https://github.com/cvxgrp/pymde/blob/main/examples/mnist.ipynb

Then, the code for creating a neighborhood-preserving embedding starts in the section called "Including dissimilar pairs".

For more in-depth documentation:

Hope that helps!

jlmelville commented 1 year ago

@akshayka thanks for pointing me to notebook. I think from looking at the functions called there, if I want to make my own Graph from a precomputed knn, and ignoring any complications due to maximum distances or custom weights, I would do something like the following for k = 15 (all of this stolen shamelessly from the pymde source code):

import numpy as np
import pymde
import pynndescent
import scipy.sparse as sp
from pymde.preprocess.graph import Graph

mnist = pymde.datasets.MNIST()

# assume we use pynndescent for the purposes of this example
# we just need a precomputed n_items x n_neighbors numpy array of indices from somewhere
mnist_nnd = pynndescent.NNDescent(
    mnist.data,
    n_neighbors=16,
    max_candidates=60,
).neighbor_graph

# ignore the "self" neighbors
idx = mnist_nnd[0][:, 1:]

# create the edge list
n_items = idx.shape[0]
n_neighbors = idx.shape[1]
items = np.arange(n_items)
items = np.repeat(items, n_neighbors)
edges = np.stack([items, idx.flatten()], axis=1)

# canonicalize edges
flip_idx = edges[:, 0] > edges[:, 1]
edges[flip_idx] = np.stack([edges[flip_idx][:, 1], edges[flip_idx][:, 0]], axis=1)

# create a sparse matrix
rows = edges[:, 0]
cols = edges[:, 1]
weights = np.ones(edges.shape[0], dtype=np.float32)
graph = sp.coo_matrix((weights, (rows, cols)), shape=(n_items, n_items))

# symmetrize
graph = graph + graph.T
graph = graph.tocsr()

graph = Graph(graph)

then I can use graph.edges to pass to e.g. the edges parameter of pymde.MDE. Does that seem right?

akshayka commented 1 year ago

Something like that does look right! Were you able to get it working?

This is probably a common enough use case that we should consider including a helper function in the library to automate it.

jlmelville commented 1 year ago

Yes thanks! I was able to make it work, although on top of a helper function to generate a Graph from nearest neighbors indices, preserve_neighbors itself has some useful functionality around e.g. initialization so refactoring that function to allow a precomputed Graph would make a big difference. Looking at it, it might involve a fair bit of extra logic or extra parameters to account for that, and I can imagine testing for any corner cases might not be worth the effort if no-one else has ever needed it.

akshayka commented 1 year ago

Great! Thanks for reporting back.

At the very least, I will keep this issue open so others can chime in if they'd like this functionality as well.