Open jlmelville opened 2 years ago
@jlmelville, sorry for the late response. Yes, with PyMDE, you can use pre-computed k-nearest neighbor data, with the caveat that you'll need to manually wire up the MDE problem (instead of calling preserve_neighbors
).
This notebook has an example that shows how to make a neighborhood preserving embedding from scratch. Start reading from the section called "Embeddings, from scratch", which shows how to create the k-nearest neighbor graph:
https://github.com/cvxgrp/pymde/blob/main/examples/mnist.ipynb
Then, the code for creating a neighborhood-preserving embedding starts in the section called "Including dissimilar pairs".
For more in-depth documentation:
Hope that helps!
@akshayka thanks for pointing me to notebook. I think from looking at the functions called there, if I want to make my own Graph
from a precomputed knn, and ignoring any complications due to maximum distances or custom weights, I would do something like the following for k = 15 (all of this stolen shamelessly from the pymde source code):
import numpy as np
import pymde
import pynndescent
import scipy.sparse as sp
from pymde.preprocess.graph import Graph
mnist = pymde.datasets.MNIST()
# assume we use pynndescent for the purposes of this example
# we just need a precomputed n_items x n_neighbors numpy array of indices from somewhere
mnist_nnd = pynndescent.NNDescent(
mnist.data,
n_neighbors=16,
max_candidates=60,
).neighbor_graph
# ignore the "self" neighbors
idx = mnist_nnd[0][:, 1:]
# create the edge list
n_items = idx.shape[0]
n_neighbors = idx.shape[1]
items = np.arange(n_items)
items = np.repeat(items, n_neighbors)
edges = np.stack([items, idx.flatten()], axis=1)
# canonicalize edges
flip_idx = edges[:, 0] > edges[:, 1]
edges[flip_idx] = np.stack([edges[flip_idx][:, 1], edges[flip_idx][:, 0]], axis=1)
# create a sparse matrix
rows = edges[:, 0]
cols = edges[:, 1]
weights = np.ones(edges.shape[0], dtype=np.float32)
graph = sp.coo_matrix((weights, (rows, cols)), shape=(n_items, n_items))
# symmetrize
graph = graph + graph.T
graph = graph.tocsr()
graph = Graph(graph)
then I can use graph.edges
to pass to e.g. the edges
parameter of pymde.MDE
. Does that seem right?
Something like that does look right! Were you able to get it working?
This is probably a common enough use case that we should consider including a helper function in the library to automate it.
Yes thanks! I was able to make it work, although on top of a helper function to generate a Graph from nearest neighbors indices, preserve_neighbors
itself has some useful functionality around e.g. initialization so refactoring that function to allow a precomputed Graph would make a big difference. Looking at it, it might involve a fair bit of extra logic or extra parameters to account for that, and I can imagine testing for any corner cases might not be worth the effort if no-one else has ever needed it.
Great! Thanks for reporting back.
At the very least, I will keep this issue open so others can chime in if they'd like this functionality as well.
Hello, is it possible to use
preserve_neighbors
with pre-computed k-nearest neighbors data in the distance-matrix-like form used by the likes of UMAP and t-SNE? In this format, the indices and distances are stored as separate matrices of shapen_objects, n_neighbors
. The element (i, j) in the index matrix is the index of thej
th nearest neighbor ofi
, and the equivalent element in the distance matrix is the distance between those two objects.I (think) I can work out how to manually convert that into the edges/weights form used by
pymde.Graph
(various acts ofmelt
ing andravel
ing), but I don't get the results I expect compared to passing the data directly. I think that's because whenpreprocess.generic.k_nearest_neighbors
is called, for the graph code path,graph.k_nearest_neighbors
hasgraph_distances=True
. So another way to put it is: is there a way to pass aGraph
so that it is interpreted as a distance matrix? This would be helpful because calculating the k-nearest neighbors is usually the slowest part of this sort of dimensionality reduction method.