Estimating the contamination fraction without biological assumptions using Cell Ranger's default clustering results or ERCC spike-ins

vertesy commented 4 years ago

Hey Matthew,

thanks for the great package. I'm looking for a simple and safe way to implement SoupX package as part of a standard single cell analysis pipeline. Steps would be:

Cell Ranger [Counting and clustering]
SoupX [Soup correction]
Seurat [Analysis]

This means all steps have to be automated, and it has to be reasonably fast. Therefore, I would like a rough (under)estimate of the contamination fraction without biological assumptions and manual tweaking. I largely read the vignette and the manuscript, and I understand your points on manual inspection and gene selection. That being said, I need automation. As you are much more experienced in this topic, I would appreciate your opinion:

Cell Ranger does clustering and differential gene expression analysis (if selected). Would it be possible to use these genes [Cell Ranger output folder]/analysis/diffexp/graphclust/differential_expression.csv and clusters [Cell Ranger output folder]/analysis/clustering/graphclust/clusters.csv to reliably estimating the contamination fraction? If so, what would be your approach?
Additionally, could the total read counts in empty droplets VS. the total read counts in cell containing droplets be used to estimate contamination? (This is probably discussed, and I overlooked it)
Alternatively, would simply adding ERCC spike-ins to the single cell suspension (right before loading it to the 10X controller) be a quick experimental fix? This would generate a set of genes that should be expressed in none of the cells.

On another note, I found that sc$isV3 checking the 10x kit version. As now all experiments run on v3+. What is the difference in processing these? v3 seems to use some information from the Soup to call cells.

Apologies for questions stemming from my ignorance & thanks for reading.

constantAmateur commented 4 years ago

I suggest you have a look at the devel branch of the code which includes a routine to automate contamination estimation. This is still in the devel branch as it has not been widely tested, so I would appreciate your feedback on its performance if you do use it.

The automated workflow would be:

sc = load10X(dat)
contEst = autoEstCont(sc)
sc = setContamination(sc,contEst$rhoEst[1])
correctedCounts = adjustCounts(sc)

With extra steps if you load your data in a different way.

The alternative if you must have automation would be to set some fixed contamination fraction for all channels using setContamination. I would actually recommend trying to set this too high rather than too low (e.g 15-20%) as the biologically important things (the marker genes) are highly expressed relative to the background and basically never removed. This is a risky approach though, so if you're doing this you should make sure you see how your results depend on the constant value you use.

ERCC spike-ins are a potential solution to this, but in practice it is too hard to ensure that you load the right spike-in concentration. Too little spike in and you get no benefit, too much and you spend all your money sequencing spike-ins.

The isV3 tag doesn't do anything after the data is loaded. It's just used to use the file format approapriate to the cellranger version within load10X

mesnger commented 4 years ago

Hello, all. I have quite a similar question as with vertesy.

I tried multiple gene list as "single list" or "combination of lists" for CalculateContaminationFraction and got quite a broad range of outputs (from 1.1% contamination rate to 10.44%). I selected gene lists by differential_expression.csv provided by Cell Ranger and/or Seurat FindAllMarkers. But as the output varies greatly by each gene lists, I could not settle on one result.

Do you have any criteria for the best input gene? Following your tutorial, I assume that "Known Gene Marker " is your go to choice, but when it is not known, what gene should I prioritize on?

A. Genes "exclusively expressed" in a subset of clusters(other clusters have near 0 value) B. Genes "highly expressed" in a subset of clusters C. Genes with "highest difference" in expression value compared to other cluster D. Genes with "high p-value" calculated by some DEG method

or is by considering all of above? (which might not lead to a sound result with real life data)

Thanks for making a great tool. Hope to get a reply soon.

constantAmateur / SoupX

Estimating the contamination fraction without biological assumptions using Cell Ranger's default clustering results or ERCC spike-ins #32