Open sgbaird opened 3 years ago
Based on some simple tests, I think this is the general workflow I'm going for, but still not sure exactly how to handle labels:
# Modules
import DiffusionEMD
from DiffusionEMD import DiffusionCheb
import numpy as np
from scipy.spatial import distance_matrix
#Setup
npts = 10
data = np.random.randn(npts, 3)
adj = graphtools.Graph(data, use_pygsp=True).W
labels = np.ones([npts,1]) # ???
#DiffusionEMD
D = DiffusionCheb()
embeddings = D.fit_transform(adj, labels)
dm = distance_matrix(embeddings, embeddings, p=1)
At the very least, I'm getting the output size that I would expect. It does make me wonder about how labeling points as part of a single vs. several distributions affects the computation. Is there somewhere in the paper that addresses this? I haven't dug into the details too much yet.
If you are trying to calculate the EMD between diracs (treating each point as a separate distribution), then you would want labels = np.eye(npts)
.
In general labels
should be a npts x ndistributions
matrix where the column sums are all 1 i.e. np.all(labels.sum(axis=1) == 1) == True
. Each column represents a distribution over the points.
I'm surprised this example is definitely bugged, I would expect embeddings to be 1 x 60
by default instead of the 10 x 60
it returns. Thank you for finding this! Looks like this is a numpy broadcasting problem.
To me I would interpret labels = np.ones([npts,1])
as a single distribution with equal weight on each point. I'll probably fix and allow this input in case someone wants to run code which separately embeds distributions (possibly online) and wants to compare later.
Feel free to email me at alexander.tong@yale.edu
and we can setup a talk more about this or any other questions you have.
Curious if you figured out where the np broadcasting issue might be (if that's the case). The shapes were a bit confusing to me as well. Thanks for the suggestions!
To make sure we're on the same page (and if you have time), do you know how these parameters would correlate with scipy.stats.wasserstein_distance
inputs: u_values
, v_values
, u_weights
, and v_weights
? I know how my data fits in with these, so if you had a "translation", then I could have some more confidence that I'm inputting my data correctly.
# TL;DR how to compute the earth mover's distance between arbitrary vectors using DiffusionEMD? # I'm going through the two notebooks (Line and Swiss) as well as testing on my own data. Part of this relates to #4 and defining a function with similar input/output to e.g. sklearn distance metrics.
Let's say you have 10 points embedded in 3 dimensions. We'll call this
data
, and say that it has 10 rows and 3 columns.data = np.random.randn(10, 3)
Based on the Jupyter notebooks, it seems like the first argument ofDiffusionCheb()
andDiffusionTree()
are the adjacency matrices (I couldn't find any documentation on this otherwise).The size of the adjacency matrix is the same as the number of vertices on a graph, so I believe this should be done by e.g.
adj = graphtools.Graph(data, use_pygsp=True).W
which produces a10 x 10
matrix.The Line and Swiss examples show
ds.labels
as the 2nd argument to DiffusionCheb(). For Line, I notice thatself.labels = np.eye(N)
, whereas for the Swiss Rollself.labels = np.repeat(np.eye(n_distributions), n_points_per_distribution, axis=0)
. For the Line example, if there are 100 points, thenlabels
is100 x 100
, which is just an identity matrix.Assuming 5 points per distribution, this looks like:
So as the name implies, it's labeling each point by which distribution it came from. Assuming there are 50 distributions, naturally there will be 50 columns.
For the
10 x 3
data
I mentioned, I think I have the adjacency matrix correct, but what about the label matrix? Assuming it's unlabeled data, is it justnp.eye(data.shape[0])
? Or is it something else, such asnp.ones([data.shape[0],1])
? If I'm understanding correctly, these two cases would suggest that each point comes from a separate distribution and that each point comes from the same distribution, respectively. What I'm hoping to calculate is the earth mover's distance betweendata[0]
anddata[1]
, etc.Any help would be appreciated. Thanks!