constantAmateur / SoupX

R package to quantify and remove cell free mRNAs from droplet based scRNA-seq data
249 stars 34 forks source link

When use different marker genes to start, contamination ratios are quite different #86

Closed peachgong closed 2 years ago

peachgong commented 2 years ago

I have samples containing both neurons and non-neurons (two major cell types here). when I use the manual way to estimate the contamination fraction, I use two sets of marker genes: one contains markers for non-neurons, and the other contains for neurons, for example:

igGenes = c("Cldn5", "Opalin","Siglech","Aqp4","C1qc","Gja1") ##non-neuron markers or igGenes = c("Syp", "Rbfox3","Elavl2") ## neural markers

and I got marker maps as below respectively: Rplot or Rplot01

When I run calculateContaminationFraction next, I got very different results: one is close to 17%, the other one is extremely as high as 45%. I understand that in my datasets, neural markers are widely contaminated across clusters. But my question is: should I correct the expression profile using 45% instead of 17%?

constantAmateur commented 2 years ago

If I had to guess, I would say 17%. When a set of manually chosen markers are not well suited to be used to estimate the contamination, the result will be an over-estimate in the contamination fraction. I suspect that is what is happening with your 45% result.

I would recommend also trying the automated contamination method and see if that agrees with either estimate.