KlugerLab / FIt-SNE

Fast Fourier Transform-accelerated Interpolation-based t-SNE (FIt-SNE)
Other
593 stars 108 forks source link

Precomputed distance matrix (take 2) #82

Closed sg-s closed 4 years ago

sg-s commented 4 years ago

Hi, I had previously asked about using FIt-SNE with a precomputed distance matrix and I was pointed to an earlier version which did support it. Thanks for that, but I now want something else.

I see that there is now an "degrees of freedom" parameter that makes the heavy tails heavier, and I'm interested in using that if possible, and I wonder if I can do that with a precomputed distance matrix.

Thanks for any help!

dkobak commented 4 years ago

This definitely not implemented in FIt-SNE in any branch.

Out of curiosity, what is your use case? Why do you have a precomputed distance matrix?

sg-s commented 4 years ago

HI @dkobak thanks for your response. This query was motivated by your recent paper

According to @linqiaozhi, an old version of FIt-SNE had this. Here's the link to the code, and the link to discussion about this

I'm trying to use t-SNE to embed a large dataset where I have the pairwise distance matrix and nothing else. I know this is weird, but this is because I don't have the "primary data" itself, not because I don't have access to it, but because in my use case, it is fundamentally hard to define. What I can define is a function that maps pairs of "data" onto a distance, so I can create a distance matrix. My goal is to use this distance matrix to reveal structure in my high-dimensional data set.

dkobak commented 4 years ago

So your N must be relatively small if you can hold the whole NxN distance matrix in memory? Around 10k max I guess?

sg-s commented 4 years ago

Yes, my N right now is ~15K (I plan on going bigger, and I'm essentially limited by RAM)

linqiaozhi commented 4 years ago

Sorry I have not had time to implement this.

The way it should be implemented is based on a sparse matrix, so RAM will not be a limitation. That is, the wrapper would just write the three vectors defining the sparse matrix to a file, and then the binary will read those vectors. Here is the example of writing the vectors in Matlab.

I'm afraid I'm just not able to work on this for now. Maybe in January I'd have more time. @sg-s, if you want to try your hand at it, then I'd be more than happy to help you if you get stuck.

sg-s commented 4 years ago

Hi @linqiaozhi sure, I can take a crack at it. I can write the data into a sparse matrix, no problem. Can you point me to the part of your code that reads this? I can then probably put the pieces together and send you a PR.

Thanks for the hint!

dkobak commented 4 years ago

If you only have a black box function to compute pairwise distances, how are you going to construct the sparse kNN matrix without first computing the full sense one?

sg-s commented 4 years ago

@dkobak I don't think it's possible to avoid computing the full distance matrix (which I have done). If I understand what's being said here, the sparsity in the matrix comes from only considering the k nearest neighbours in the full matrix, right?

linqiaozhi commented 4 years ago

Yes, it is possible to compute an (exact or approximate) sparse kNN matrix without computing all the pairwise distances. In fact, that's exactly what FIt-SNE (and BH t-SNE for that matter) does by default! Essentially any fast KNN method does that. FIt-SNE uses VP trees for exact KNN, and it only requires that the "distance function" be a metric. @dkobak , did I misunderstand your question?

The feature that we are missing is for users to provide their own sparse distance matrix and embed it with t-SNE. A user could generate this from their own nearest neighbor algorithm, or it could just be a graph that they want to embed (that does not correspond to nearest neighbors at all).

@sg-s One question is whether we want users to be able to pass a distance matrix or a fully normalized P matrix. That is, do you want to provide the distances and let FIt-SNE take care of applying the Gaussian kernel and normalizations? Or do you want to be able to pass the affinities themselves?

linqiaozhi commented 4 years ago

@sg-s thanks for being willing to take a stab at it!

As I mentioned above, there are two options:

Option 1: Pass the affinities. That is, trust the user to apply the kernel converting distances to affinities. This is nice because I bet there are lots of more interesting affinities than Gaussian with perplexity that we use!

Here is the code that loads in the P matrix. As you can see, it is three different vectors, this is the Compressed Sparse Row format.

It's easiest to just see in an example (and take a look at the wikipedia page linked above): col_P[row_P[n] + m] is index of the mth node that the nth node connects to. val_P[row_P[n] + m] is value of the affinity between the nth node and the mth node that it connects to.

So, this code is already there. You should try and set load_affinities to true, and test it to see if it works! That is, if you want to pass the affinities.

Option 2: Pass the distances, and let the algorithm apply the kernel.

You'll probably want the wrapper to write three vectors corresponding to your sparse distance matrix just like above (maybe call them val_D, col_D, row_D).

Then you should probably make a new computeGaussianPerplexity() function. See how this one computes the nearest neighbors using ANNOY and then computes the affinities from them? You'll want to make a new function that does essentially the same thing, but instead of computing the nearest neighbors here, it should just look them up in the loaded val_D, col_D, and row_D (where each of these will be similar to the three vectors defined above but they have distances).

Is that clear? Please feel free to ask any questions...I will be more responsive now that the holidays are over.

dkobak commented 4 years ago

It should be much easier to implement this in openTSNE which is designed to be easily extendible/modifiable. It uses pynndescent for kNN search and supports a lot of metrics out of the box. One can easily plug your own distance matrix / affinity matrix into openTSNE. In fact, I don't know if pynndescent supports a user defined metric, but it very well might, which would allow to directly get sparse kNN matrix.

I am not sure it's worth implementing any of that in the C++ code here.

(The only thing openTSNE currently does not support is the variable degree of freedom; hopefully it will eventually get implemented there too...)

sg-s commented 4 years ago

OK, I'll take a crack at it. Will look at how fast_tsne.cpp works and will report back. Thanks!

linqiaozhi commented 4 years ago

@dkobak, yes, it might be easier to implement in openTSNE. While it may be a little slower, that is probably not prohibitive for most users. However, openTSNE is pure Python; it's possible that @sg-s also likes having an R or MATLAB interface like we have here.

@sg-s , sounds great! Just to be clear, it looks like you are going for "Option 1" I mentioned above, correct? That is, the user will provide the affinities, which will be plugged directly into the P matrix by fast_tsne.

sg-s commented 4 years ago

HI @linqiaozhi and @dkobak, I have a workaround that achieves what I want. Turns out I did not have to modify any of your code, and the functionality I was looking for already exists in your codebase.

Here are the steps to embed a distance matrix using FIt-SNE:

  1. Convert distance matrix to affinity matrix (I used van der Maaten's code and saved P at this point after symmetrizing and normalizing)
  2. Write out P into the format FIt-SNE requires using the function write_P (see below)
  3. Run fast_tsne (wrapper for FIt-SNE) using the opts.load_affinities = 'load'. The hack here is you still have to specify the "raw data", I just gave it zeros of the right size (same N as distance matrix)

The results seem to make sense, and the only thing left is this:

How do I sparsify my affinity matrix? Is there some rigorous way to do this other than throwing out all except K-highest values per row?

function write_P(P)
% WRITE_P Writes a sparse matrix P to files that FIt-SNE can read in lieu of computing P based on nearest neighbors. The three vectors it produces are as follows: 
% val_P:  The K non-zero values of a sparse N by N matrix P
% col_P: a K-length vector of unsigned ints, giving the column indices of each element
% row_P: a vector of N+1 length of unsigned ints, giving the index in the preceding two matrices corresponding to where  each row "starts".

delete P_col.dat
delete P_row.dat
delete P_val.dat

[N,~] = size(P);
row_P = zeros(N+1,1);
row_P(2:end) = full(cumsum(sum(P~=0,2)));
[I, J, V] = find(P');
val_P = V;
col_P = I-1;
h = fopen('P_val.dat', 'w','n');

fwrite(h, val_P, 'double');
fclose(h);
h = fopen('P_col.dat', 'w','n');
fwrite(h, col_P, 'integer*4');
fclose(h);
h = fopen('P_row.dat', 'w','n');
fwrite(h, row_P, 'integer*4');
fclose(h);
dkobak commented 4 years ago

The standard way to compute sparse similarities in tSNE given desired perplexity P, is to find 3P nearest neighbors for any point i, use these 3P distances to search for a kernel width (sigma) that would yield perplexity P, convert these 3P distances into affinities, and set the other N-3P values to zero. This heuristic was suggested in the Barnes-Hut tSNE paper, and everybody else is using it since then.

sg-s commented 4 years ago

Thanks, that's clear! Closing this for now