constantAmateur / SoupX

R package to quantify and remove cell free mRNAs from droplet based scRNA-seq data
255 stars 34 forks source link

Spurious DEGs introduced by SoupX? #54

Closed miscellaneousj closed 3 years ago

miscellaneousj commented 4 years ago

I introduced SoupX into my pipeline after finding that certain genes were ubiquitously differentially expressed between conditions across all cell types in my dataset and suspecting that this was due to ambient RNA. SoupX greatly improved the consistency of clustering across different resolutions after integrating the samples. However, I have noticed that many of the new DEGs appear to be spurious - for example, Plp1, a gene that is expressed specifically in oligodendrocytes, is now DE in microglia between conditions - expressed in 64% and 75.5% of microglia in the two conditions respectively!

Plp1, and the other spurious DEGs, are highly expressed in the soup and among those most adjusted by SoupX. I suspect that these DEGs are introduced due to inconsistencies in processing between samples - a larger fraction of the soup was (presumably) removed in some samples than others. Comparing my results with the literature, I believe that previously the DEGs were genuinely DE between conditions, but not necessarily DE in those specific cells, whereas now some of the DEGs are neither genuinely DE between conditions nor genuinely expressed in those cells.

I used the automated pipeline and the estimated global rho for each of my samples was 4-6%. Clearly this is an underestimate as there is a substantial amount of ambient RNA remaining and so I am looking into the manual estimation method. However, I'm not sure how to guarantee that all of the samples are adjusted to an equivalent degree. I could set the contamination fraction to be the same for every sample, but I'm not sure whether this would be better or worse as some samples may contain a higher proportion of ambient RNA than others.

I suspect that part of the issue may be that some empty drops contained a similar amount of RNA as some real cells and so were not correctly filtered out by CellRanger. I usually remove these after SoupX. Will removing these cells and then using DropletUtils:::write10xCounts to replace the files in the outs/filtered_feature_bc_matrix folder be sufficient to provide the correct inputs for SoupX?

I'm currently trying different processing methods to see how the output is affected, but in the meantime I would really like to know whether you have encountered the spurious differential expression problem before and if you have any advice on how to resolve it.

constantAmateur commented 3 years ago

This is not something I've encountered, but is certainly possible if a variable amount of contamination remains in different samples. The automatic estimation procedure should provide consistent results across samples, but as with any method, won't be accurate 100% of the time. I would check if the samples with high remaining Plp1 expression are the ones with lower estimated contamination fraction. If this is the case, you could consider manually setting the contamination fraction for those samples more in line with the average of your other samples.

It sounds like your estimated contamination is pretty low generally. I consider 2% basically the floor of how low the contamination is likely to get. You see this level of contamination even in highly controlled experiments with cell lines. So I would say there's probably scope to manually set the contamination to something higher (potentially even setting 5-10% for all samples) without much harm.

Keep in mind that the contamination estimate (either manual or automatic) has an uncertainty of at least 1-2% even in the best of circumstances. You can see this if you look at the plot produced by autoEst or the full width half maximum value produced by it (sc$fit$rhoFWHM).