Subsetting fastMNN integrated data

fxgaldos commented 4 years ago

Hello,

I have integrated multiple datasets successfully using fastMNN. I am now needing to subset the data to focus on the analysis of specific clusters. My questions are as follows:

Is it recommended that I rerun the fastMNN integration again on the subsetted data?
Would it be appropriate to use subsetted corrected PCA dimensions done from the first round of integration for dimensional reduction?

In other discussion threads that have discussed integration such as Seurat's CCA, it is not recommended to rerun the integration if an integrated dataset is subsetted. However, Seurat conducts the correction in the gene expression space versus the PCA space like fastMNN so it is unclear to me what the best approach should be.

Any advice would be greatly appreciated!

LTLA commented 4 years ago

Is it recommended that I rerun the fastMNN integration again on the subsetted data?

I would say that the best results would be obtained by actually repeating the entire analysis (starting, at least, from HVG detection) on the subsetted data. This gives you an opportunity to identify features that are most relevant to heterogeneity within the subset, in contrast to your previous set of features that are probably driven by heterogeneity across clusters. For example, why have immunoglobulins in your feature set when your subset of interest only contains T cells?

If you're planning on re-identifying the HVGs, then you'll need to re-run the integration on the subset as well. This should be similarly beneficial because fastMNN has a better chance of being able to see important variation within the subset. For example, the full dataset analysis might not be able to resolve all those tiny T cell subtypes that immunologists get all excited about, but if we focus in on the subset, some of that variation may become visible and considered by fastMNN when it does its correction. This should result in more accurate mapping of those subtypes across batches.

Of course, this reasoning assumes that the original integration was reasonably correct. If some other population (e.g., the NK cells?) got merged with your subset of interest, then your analysis on the subset would be stuffed. But then again, the whole analysis would be stuffed, so subclustering can't make it any worse. FYI there's at least a few diagnostics and sanity checks for determining how sensible the correction is, see the relevant chapter of the OSCA book for details.

2. Would it be appropriate to use subsetted corrected PCA dimensions done from the first round of integration for dimensional reduction?

You can certainly do so. (I assume you're referring to making t-SNEs or UMAPs.) Compared to what I said above, this is probably not the best approach but it is more convenient and that might be a good enough reason to do it, e.g., if your dataset has a million cells and you don't want to spend another few hours waiting for results.

LTLA commented 4 years ago

I'm not sure this has anything to do with this. smooth_gaussian_kernel() is used in mnnCorrect, not fastMNN. If you want to report a bug, open a new issue.

LTLA / batchelor

Subsetting fastMNN integrated data #18