constantAmateur / SoupX

R package to quantify and remove cell free mRNAs from droplet based scRNA-seq data
249 stars 34 forks source link

SoupX in single nuclei RNAseq #76

Closed wmacnair closed 3 years ago

wmacnair commented 3 years ago

Hi

I'm trying to use SoupX to remove contamination from single nuclei RNAseq. I've run it (with default settings) on our samples, and I thought I would check the estimated proportions against the sample median mitochondrial proportions (which we expect to be a rough proxy for contamination in snRNAseq, or at least to have some relationship with contamination). Our samples can be quite stressed/degraded, so we observe quite a range of mito values (e.g. between 0.1% to 10% mito reads across samples), and I expected to see a strong correlation between estimated contamination fraction and median mito %.

I do get a correlation, however it's pretty weak: 2.3% contamination at 0.1% mito reads, 3.2% contamination at 10% mito reads (based on a linear model fit to the contamination estimates and sample median mito props). Is this expected? I was really surprised that two orders of magnitude of mito proportion difference only made a difference of 1% contamination.

Could I be doing something wrong? Perhaps there are particular parameter values that are a better fit for snRNAseq?

Cheers Will

(Also, thanks for all the work on the package. In particular, the vignette is excellent - detailed while being clear and readable.).

constantAmateur commented 3 years ago

How to use SoupX on nuclei is something I am in two minds about. On the one hand, as you suggest, mitochondrial reads should be a good proxy for contamination as they are purely cytoplasmic. If you are happy to accept this as "the truth" then the solution is simple: manually set the contamination fraction to the MT fraction (or use the estimateNonExpressingCells and calculateContaminationFraction functions using MT genes as input). The default approach of using autoEstCont is only necessary because usually there's no a priori set of genes that can be used to estimate the contamination. So if MT is a reliable way to measure contamination then you should always just use that.

On the other hand, while it's true that MT expression is a type of contamination, it's not clear to me that it's well correlated with the broader, transcriptome wide, contamination that SoupX is trying to remove. Your own experience with a low level of correlation between the two ways of estimating the contamination (using MT fraction and autoEstCont) would seem to suggest that the correlation is far from perfect. Your data shows less of a correlation between the SoupX estimate and MT frac than I would have guessed based on other single nuclei data. However, almost all my work uses single cells so my expectations are not as thoroughly calibrated from personal experience with nuclei.

It is possible you are making some mistake in running the package, but if you're just doing the standard autoEstCont from 10X data mapped with cellranger (which is what I'm assuming you're doing), that seems unlikely. If you just want a quick practical solution, set everything to the maximum contamination that you think is present (i.e., max(MT frac, estimate from autoEstCont)). But if you want to dig a bit more into this, I'd start by adding an indicator of the MT fraction to the plots produced by autoEstCont and seeing what that shows.

wmacnair commented 3 years ago

Hi, thanks for your thoughts

I have had similar thoughts / doubts about the extent to which we can use mitochondrial reads as a proxy for contamination in snRNAseq. Thinking mechanistically, you can well imagine that the processes that lead to contamination by soup and by mitochondria could be different: perhaps mitochondria are stickier, and therefore come attached to other things with characteristic transcriptional profiles.

Since downstream I'm mainly working with pseudobulk data, what I ended up doing was using the approach recommended in OSCA (here), based on DropletUtils::maximumAmbience. This doesn't make any assumptions about which genes should be present/absent in the soup, and just finds the maximum proportion of the reads in that cluster that could plausibly be explained by soup. The estimated contamination proportions actually had a very strong correlation with mitochondrial proportion (also with # cells, library sizes etc). I was a bit surprised by how strong the relationship was and I wondered if it might be technical (e.g. samples with fewer cells have less information therefore higher plausible max proportions). However, repeating the analysis with random soup vectors gave orders of magnitude less contamination, so I think it's real in some sense.

(I also saw that the estimated contamination proportion varied quite a bit across celltypes within the same sample, which I guess is what not what SoupX assumes.)

One other observation in snRNAseq is that the ratio of spliced to unspliced reads is very highly correlated with mito %, so could be another marker of contamination (I think they note this in this DIEM paper).

Overall it seems like it could be an interesting topic to look at. Although by now you're probably happily working on something new...! ;)

Thanks again, Will