Input argument dimensions and calculating DiffusionEMD between vectors in a feature matrix

sgbaird commented 3 years ago

# TL;DR how to compute the earth mover's distance between arbitrary vectors using DiffusionEMD? # I'm going through the two notebooks (Line and Swiss) as well as testing on my own data. Part of this relates to #4 and defining a function with similar input/output to e.g. sklearn distance metrics.

Let's say you have 10 points embedded in 3 dimensions. We'll call this data, and say that it has 10 rows and 3 columns. data = np.random.randn(10, 3) Based on the Jupyter notebooks, it seems like the first argument of DiffusionCheb() and DiffusionTree() are the adjacency matrices (I couldn't find any documentation on this otherwise).

The size of the adjacency matrix is the same as the number of vertices on a graph, so I believe this should be done by e.g. adj = graphtools.Graph(data, use_pygsp=True).W which produces a 10 x 10 matrix.

The Line and Swiss examples show ds.labels as the 2nd argument to DiffusionCheb(). For Line, I notice that self.labels = np.eye(N), whereas for the Swiss Roll self.labels = np.repeat(np.eye(n_distributions), n_points_per_distribution, axis=0). For the Line example, if there are 100 points, then labels is 100 x 100, which is just an identity matrix.

Assuming 5 points per distribution, this looks like:

1 0 0 ... 0
1 0 0 ... 0
1 0 0 ... 0
1 0 0 ... 0
1 0 0 ... 0
0 1 0 ... 0
0 1 0 ... 0
0 1 0 ... 0
0 1 0 ... 0
0 1 0 ... 0
0 0 1 ... 0
etc.

So as the name implies, it's labeling each point by which distribution it came from. Assuming there are 50 distributions, naturally there will be 50 columns.

For the 10 x 3 data I mentioned, I think I have the adjacency matrix correct, but what about the label matrix? Assuming it's unlabeled data, is it just np.eye(data.shape[0])? Or is it something else, such as np.ones([data.shape[0],1])? If I'm understanding correctly, these two cases would suggest that each point comes from a separate distribution and that each point comes from the same distribution, respectively. What I'm hoping to calculate is the earth mover's distance between data[0] and data[1], etc.

Any help would be appreciated. Thanks!

sgbaird commented 3 years ago

Based on some simple tests, I think this is the general workflow I'm going for, but still not sure exactly how to handle labels:

# Modules
import DiffusionEMD
from DiffusionEMD import DiffusionCheb
import numpy as np
from scipy.spatial import distance_matrix

#Setup
npts = 10
data = np.random.randn(npts, 3)
adj = graphtools.Graph(data, use_pygsp=True).W
labels = np.ones([npts,1]) # ???

#DiffusionEMD
D = DiffusionCheb()
embeddings = D.fit_transform(adj, labels)
dm = distance_matrix(embeddings, embeddings, p=1)

At the very least, I'm getting the output size that I would expect. It does make me wonder about how labeling points as part of a single vs. several distributions affects the computation. Is there somewhere in the paper that addresses this? I haven't dug into the details too much yet.

atong01 commented 3 years ago

If you are trying to calculate the EMD between diracs (treating each point as a separate distribution), then you would want labels = np.eye(npts).

In general labels should be a npts x ndistributions matrix where the column sums are all 1 i.e. np.all(labels.sum(axis=1) == 1) == True. Each column represents a distribution over the points.

I'm surprised this example is definitely bugged, I would expect embeddings to be 1 x 60 by default instead of the 10 x 60 it returns. Thank you for finding this! Looks like this is a numpy broadcasting problem.

To me I would interpret labels = np.ones([npts,1]) as a single distribution with equal weight on each point. I'll probably fix and allow this input in case someone wants to run code which separately embeds distributions (possibly online) and wants to compare later.

Feel free to email me at alexander.tong@yale.edu and we can setup a talk more about this or any other questions you have.

sgbaird commented 3 years ago

Curious if you figured out where the np broadcasting issue might be (if that's the case). The shapes were a bit confusing to me as well. Thanks for the suggestions!

To make sure we're on the same page (and if you have time), do you know how these parameters would correlate with scipy.stats.wasserstein_distance inputs: u_values, v_values, u_weights, and v_weights? I know how my data fits in with these, so if you had a "translation", then I could have some more confidence that I'm inputting my data correctly.

KrishnaswamyLab / DiffusionEMD

Input argument dimensions and calculating DiffusionEMD between vectors in a feature matrix #5