MarioniLab / FurtherMNN2018

Code for further development of the mutual nearest neighbours batch correction method, as implemented in the batchelor package.
22 stars 6 forks source link

Could I export the MNN corrected dataset matrix for downstream analysis? #5

Closed sunkh1026 closed 5 years ago

sunkh1026 commented 6 years ago

I have two 10X droplet datasets: "data.1" and "data.2" and have successfully done MNN correction as below.

set.seed(1000)
mnn.out2 <- fastMNN(logcounts(data.1), logcounts(data.2), k=20,
                    approximate=TRUE, cos.norm=TRUE)
t.mnn <- mnn.out2$corrected

# Generating a t-SNE plot.
set.seed(0)
tsne.mnn <- Rtsne(t.mnn, perplexity = 30)
plotFUN("results/tsne_mnn2_type.png", tsne.mnn$Y, main="Fast MNN", xlab="tSNE 1",ylab="tSNE 2")
plotFUNb("results/tsne_mnn2_batch.png", tsne.mnn$Y, main="Fast MNN", xlab="tSNE 1",ylab="tSNE 2")

My question: Could I export this MNN-corrected dataset (could I called it scaled data?) containing both samples from "data.1" and "data.2" using write.table()? That is, each row will be the gene, and each column will be the sample from "data.1" and "data.2", and the value in this table will be MNN-corrected scaled values (corrected from counts). I hope to use it for downstream analyses, such as some functions in Seurat or Monocle package. Thank you for your advices!

LTLA commented 6 years ago

Firstly, there are no longer "genes" in the corrected data. fastMNN does a PCA internally, so what you're getting a basically the corrected PC coordinates. This should be apparent when you look at the dimensions of t.mnn, which only contains 50 features in comparison to ~1000's of genes.

You can of course export the matrix using write.table, but you must remember that you're not dealing with genes any more. If you want to use downstream methods, you should be using those that accept arbitrary matrices. See, for example, buildSNNGraph, which can take an existing matrix of PCs if d=NA (to avoid re-doing the PCA) and transposed=TRUE. Similarly, you can use Rtsne with pca=FALSE.

LTLA commented 5 years ago

You can recover "corrected expression values" by multiplying the corrected PCs with the rotation vectors for the genes of interest. The output will not be sparse and the statistical properties of the corrected values are probably unpredictable, so I would only use them for visualisation, e.g., colouring points by or showing violin plots of expression. I wouldn't use them for further gene-based analysis.

whitleyo commented 5 years ago

Hi, for the purposes of differential expression perhaps something nonparametric like the Wilcoxon Rank Sum test might be appropriate? If I understand correctly, we're assuming that whatever was subtracted out in PCA space for fastMNN is a batch effect, and likewise any difference in gene expression that can be accounted for by these differences in PCA space.

LTLA commented 5 years ago

The distributional assumptions aren't the main problem. It's more about the difficulty of interpretation.

The most obvious effect is that the values are cosine-normalized, so you can't take the log-fold changes at face value. They simply don't reflect the actual log-fold changes anymore.

However, this is only a symptom of the more subtle (but much more dangerous) effect - that the correction is not obliged to retain the relative differences in expression values between cells. The correction only really cares about the distances between cells, and if it has to massively distort a given gene's expression values to reduce the distances between different batches, it will do it.

To illustrate, consider a dataset with two cell types, A and B. Consider another batch with the same cell types, denoted as A' and B'. Let's say that, for some reason, gene X is expressed in A but not in A', B or B'. Maybe that's due to some difference in how the cells were grown between batches, the exact reason doesn't matter. We then merge the batches together based on the shared cell types, which gives us a nice result where A and A' cells are nicely intermingled - so far so good.

Now, this is where the fun begins. If we corrected the second batch to the first, we must have coerced the expression values of gene X in A' to non-zero values in order to align with those of A, while leaving the expression of B' and B at zero. (Yes, I know fastMNN() works in low-dimensional space, so just imagine it in terms of undoing the rotation to recover corrected expression values.) So, lo and behold, we now have differential expression between A' and B' when it wasn't there before - magic!

But maybe this doesn't matter, right? X was DE between A and B, so who cares if we introduced false DE between A' and B'? Well, it matters when you're trying to interpret the DE genes to identify the clusters. Let's say A contains activated CD8+ T cells, and A' contains their naive counterparts, and X is one of the activation markers. If we did the DE analysis on the corrected values, we would think - incorrectly - that activated cells exist in both batches. This would range from "meh" to "oh crap" depending on how much your conclusions depended on that. For example, if the first batch was drug-treated and the second batch was a control, you might think that the drug has no effect...

However, there's no need to take this risk! Just do the DE analysis on the uncorrected expression values, blocking on the batch. If there was any genuine DE between the identified clusters, it should hold up when you compare expression values within each batch. This is because the within-batch comparison is, by definition, free of batch effects; so if you still get DE there, you can be confident that the effect is not due to the batch effect or the correction procedure's attempts at removing it.