Display learning rate parameter

jolespin commented 7 years ago

Is there a way to adjust the learning rate parameter? Also, what is the default learning rate parameter? I checked the publication but I didn't see it in there.

claczny commented 7 years ago

@jolespin VizBin uses the default learning rate of BH-SNE. I did not find, however, any details about this parameter in the BH-SNE preprint @ https://arxiv.org/pdf/1301.3342v2.pdf. Thus, I suspect that BH-SNE follows t-SNE in this respect and the t-SNE paper (https://lvdmaaten.github.io/publications/papers/JMLR_2008.pdf) says

The learning rate η is initially set to 100 and it is updated after every iteration by means of the adaptive learning rate scheme described by Jacobs (1988).

Does this answer your question? Kindly let me know should you have further questions and/or comments.

Best,

Cedric

jolespin commented 7 years ago

Hey @claczny , that definitely gives some insight! Thanks for the references, I will check these out ASAP. I'm new to this type of informatics but it seems really useful after seeing what VizBin can do. I am trying to learn how to use BH-SNE and thought that reproducing VizBin plots would be a good sanity check since VizBin does such an amazing job binning things out.

I've been trying to implement it in Python using scikit-learn (http://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html). In http://www.nature.com/articles/srep04516 I found the following default parameters were used for BH-SNE: perplexity of 30, initial dimensions of 50, theta of 0.5.. If the learning_rate parameter is dynamic then that may be the source of the difference. I tried looking at the source code of VizBin for the default settings but Java is vastly different than Python syntax.
Thanks again! Your tool is the gold standard for binning. Best, -Josh

claczny commented 7 years ago

@jolespin Thank you very much for the kind words about VizBin and we are very glad that you like it that much!

Coincidentally, I am currently also using scikit-learn and the BH-/T-SNE implementation it offers in a different context. I realized that there appear to be some notable differences to the original implementation by van der Maaten as well as other implementations I saw, e.g., in R. The learning rate is one. Another one is that most of the other implementations I saw have the initial PCA step (by default, reduction to 50D) directly integrated. That does not seem to be the case for manifold.TSNE:

It is highly recommended to use another dimensionality reduction method (e.g. PCA for dense data or TruncatedSVD for sparse data) to reduce the number of dimensions to a reasonable amount (e.g. 50) if the number of features is very high.

Depending on how you have approached this in your implementation, this may have a, more-or-less-strong, effect on the results and might explain the differences.

Moreover, VizBin uses an adapted version of the BH-SNE code from van der Maaten as its backend. This means that the actual embedding is not done in Java but in C/C++. This code is then cross-compiled to enable running VizBin on Windows, Mac, and Linux. We tried hard to get the results to be identical across platforms and I have to admit that we did not fully achieve that goal. This is, again, in part due to the initial PCA: in order to make it run quick, we decided to use the MTJ library (https://github.com/fommil/matrix-toolkits-java) which is itself based on BLAS and LAPACK. However, these need to be available on the different operating systems and we found that there is some variation in the results depending on the underlying operating system's libraries. This difference, then again, leads to slightly different results.

Another point to consider in all this is that there is some randomness in the initialization of BH-SNE (it's a non-convex optimization problem after all). For the sake of reproducibility, we keep, by default, the seed fixed as the results depend on the seed-value. This as well as how the random numbers are generated is likely to lead to differences between different implementations.

Long story short, I would suggest that you create an independent sanity check and rely on VizBin qualitatively rather than quantitatively as it is very unlikely to achieve the exact same results with a completely different implementation.

Regarding the plots you have attached, do they depict the same data? If so, ignoring differences in shape and position, I am surprised to see rather 4 clusters in the VizBin plot and pretty much 2 in your implementation. Moreover, if it is the same data, is it some simulated data or some real-world data?

Best,

Cedric

P.S. I close this issue as it appears to be solved. This does not affect the ongoing discussion; I am more than happy to keep on discussing about this with you.

jolespin commented 7 years ago

@claczny Thanks for the explanation! Interesting that the scikit-learn implementation differs between versions in R and the original. I noticed that the PCA step wasn't included but, luckily, a few tutorials I saw online (e.g. http://nbviewer.jupyter.org/urls/gist.githubusercontent.com/AlexanderFabisch/1a0c648de22eff4a2a3e/raw/59d5bc5ed8f8bfd9ff1f7faa749d1b095aa97d5a/t-SNE.ipynb) pointed it out.

Regarding the random_state, I set it to 0 but I doubt the random_state number generation are universal across platforms. The data are depicting the same things, it was with real-world data. For the sanity check in VizBin, looking at the annotations for the bins have been pretty accurate which is why I was interested in the methods!

It looks like the van der Maaten method works a lot better with separating the samples. From your experience, do you think the biggest difference is the adaptive learning_rate?

Thanks again, Josh

claczny commented 7 years ago

@jolespin

From your experience, do you think the biggest difference is the adaptive learning_rate?

Honestly, I am not sure. One thing I have not yet really fully got into is why the scikit-learn implementation typically terminates relatively early, although other implementations typically run for a full 1,000 iterations. I do not want to say that running that many iterations is necessary in each case, but I find the termination somewhat "early" in scikit-learn. What I would recommend you to give it a try would be the Rtsne implementation in R: basically produce the output in python but then feed it into R to see what it returns and if this is closer to the results from VizBin. That should be quite fast to do.

I have to admit that this "bow"-like shape in the VizBin plot is quite fascinating. I have seen this kind of structure from time-to-time in various metagenomic datasets but I never really had the time to look more closely into them, e.g., are these viral sequences or what is "up" with them :)

Hope this helps.

Best,

Cedric

jolespin commented 7 years ago

I've been wondering the same thing! I tried mapping the contigs to 16s sequences and the bins in the top left along with the bottom bin map to bacteria but the crescent-shaped one didn't produce anything in the 16s database. I wonder what type of k-mer variation would produce something like that or, for that matter, variation in general could produce a curve like that in this space.

In regards to the sklearn t-sne implementation, these could be the culprit:

n_iter_without_progress : int, optional (default: 30)
Maximum number of iterations without progress before we abort the optimization.
New in version 0.17: parameter 

n_iter_without_progress to control stopping criteria.
min_grad_norm : float, optional (default: 1E-7)
If the gradient norm is below this threshold, the optimization will be aborted.

metric : string or callable, optional
The metric to use when calculating distance between instances in a feature array. If metric is a string, it must be one of the options allowed by scipy.spatial.distance.pdist for its metric parameter, or a metric listed in pairwise.PAIRWISE_DISTANCE_FUNCTIONS. If metric is “precomputed”, X is assumed to be a distance matrix. Alternatively, if metric is a callable function, it is called on each pair of instances (rows) and the resulting value recorded. The callable should take two arrays from X as input and return a value indicating the distance between them. The default is “euclidean” which is interpreted as squared euclidean distance.

I didn't notice that there was a metric method in there. I could look into what the original uses as this.

jolespin commented 7 years ago

Hey about the horse-shoe effect in the kmers: http://stats.stackexchange.com/questions/158552/what-is-the-horseshoe-effect-and-or-the-arch-effect-in-pca-correspondence

jolespin commented 7 years ago

So you're implementation is insanely fast. I was checking out the source code and noticed that you added parallelization. Is there a way to use the t-SNE hack you've implemented either on the command-line or in any language where you can give it a dataframe/tsv/csv? I tried running the sklearn on this dataset w/ like 40k points and my computer was not happy. I found this but it causes my kernel to die sometimes: https://github.com/DmitryUlyanov/Multicore-TSNE . Any advice would be greatly appreciated. Thanks!

claczny commented 7 years ago

The parallelization is the work of @tomekster (https://www.researchgate.net/profile/Tomasz_Sternal).

VizBin wraps the compiled binary of the parallelized t-SNE, i.e., similar to the original implementation by Laurens, it writes an input file containing the PCA-reduced input matrix and some control parameters (number of points, dimensionality, threads to use, etc.) in the first lines followed the the actual feature matrix, and then runs the parallelized t-SNE version on that. A file is then returned and read into VizBin again for visualization etc.

The easiest way to get the compiled binary for your platform is probably by downloading the VizBin binary, executing it, and then looking into your $HOME/.vizbinfolder for a pbh_tsne_* binary. Alternatively, you can also look into the files under https://github.com/claczny/VizBin/tree/devel/src/backend/bh_tsne for more details.

N.B. The speedup is far from linear. But it is nice to have nevertheless IMO :)

Hope that helps.

Best,

Cedric

claczny / VizBin

Display learning rate parameter #38