Distance matrix for t-sne (Rtsne)

MarioniLab / MNN2017

Code for the MNN manuscript figures

51 stars 19 forks source link

Distance matrix for t-sne (Rtsne) #7

Closed yeuyeuh closed 6 years ago

yeuyeuh commented 6 years ago

Hi,

Thanks for providing the R codes of your manuscript. I'm using the last version of the scran package (version 1.7.11).

In PancreasCorrectionComparison.R, you used two different methods to generate the t-sne of the corrected matrix: -conventional method (gene-cell matrix as input and pca calculation) -distance matrix as input

In the figures of your manuscript, do you plot the "distance matrix t-sne"?

For my data, I provide a UMI count matrix to mnnCorrect() and I use cos.norm.in=TRUE, cos.norm.out=TRUE. The "distance matrix tsne" of the corrected data seems pretty good (the batch are merged together), but the "conventional tsne" doesn't merge the different batchs... However, when I run a PCA on the corrected matrix, the batch effect seems to be removed.

Can you explain why there is such a difference between the two methods used by t-sne? Which one shall we choose?

Thanks, Inaki Cervera-Marzal

LalehHaghverdi commented 6 years ago

Hi Inaki, In our manuscript we plot the "distance matrix t-sne". In a conventional t-sne run (Rtsne library), a PCA step precedes the actual t-SNE algorithm, and is meant to amplify the "interesting" signal in data assuming that the "interesting" signal is among the first (let's say 30) PCs. For batch corrected data however, these first 30 PCs capture a residual batch effect that has not been completely removed, (even though the plot on PC1 and PC2 looks still fine). Using the distance matrix t-sne skips the preceding PCA step, thus avoids such unwanted partial amplifications of data.

yeuyeuh commented 6 years ago

Hi Laleh,

Thanks for your clear answer.

If I understand it right: You choose to use the "distance matrix t-sne" because you are working on a cosine-normalized matrix. So when you compute the euclidean distance, the distance matrix (cosine distances) is robust to residual batch effect. But when you are working on corrected matrix normalized on the log-scale, you don't use the "distance matrix t-sne" but you use the "conventional t-sne". Is that right?

LalehHaghverdi commented 6 years ago

No, we never use the "conventional t-sne". It is only included in the code for comparison and out of curiosity, it is not used anywhere for the manuscript. With "distance matrix t-sne" we compute the distance matrix on the whole data, whereas "conventional t-sne" computes the distance matrix on only 30 first PCs.

yeuyeuh commented 6 years ago

Ok, thanks for the information.

LalehHaghverdi commented 6 years ago

Sure, thanks for the question.