clementfarabet / manifold

A package to manipulate manifolds.
141 stars 31 forks source link

Own pairwise distances #9

Open mlajtos opened 9 years ago

mlajtos commented 9 years ago

Would it be possible to add function which computes Barnes-Hut SNE embedings from a precomputed pairwise distance matrix?

lvdmaaten commented 9 years ago

When you say "pairwise distance matrix", do you mean a full NxN matrix? That does not make much sense to me: the whole point of Barnes-Hut SNE is that it is O(N log N) in computation and O(N) in memory, whereas a full pairwise distance matrix uses O(N^2) memory.

If you mean a sparse adjacency matrix that is O(|E|) with E the set of edges in the graph: yes, that would be possible and relatively straightforward to achieve.

mlajtos commented 9 years ago

Thank you for your reply.

Yes, I have a sparse non-symetric NxN matrix (N~=10E4) where each element represents similarity (scaled to <0, 1>) between all the pairs. So plugging this matrix would therefore skip all SNE magic and only performed embedding as force-directed layout algorithms do?

I tried to use this NxN matrix directly into the BH SNE, but performing PCA on this matrix is taking and an eternity and awful amount of memory. I was able to do so only with N~=5000 within reasonable time. Disabling the PCA in tsne.lua won't help – there was an error on line 107: local data_char = ffi.new('char[?]', nchars) as it could not determine the size of the array or it was too big.

Any hints how to modify source code (or possibly data) to get something out of it?

lvdmaaten commented 9 years ago

https://github.com/clementfarabet/manifold/pull/4/files#diff-febf6147a17c3712ede775e407101a27R114

This is where the similarity matrix P is constructed based on the input NxD matrix X. The similarity matrix is stored in row-compressed sparse format: row_P is a N+1 int array containing the row indices, col_P is a int NxK array containing the column indices, and val_P a double NxK array containing the similarity values. The call to computeGaussianPerplexitySparse() here should thus be replaced by a call to a function that copies the data from your sparse similarity matrix into this row-compressed sparse format.

This would imply making some changes in how the function run() is called by the FFI wrapper: you would need to adapt it to reflect the changes in your input format. Disable the PCA step altogether, but let run() and work directly on your sparse similarity matrix.