constantAmateur / SoupX

R package to quantify and remove cell free mRNAs from droplet based scRNA-seq data
253 stars 34 forks source link

How to proceed when "Extremely high contamination estimated"? #60

Closed gene-drive closed 3 years ago

gene-drive commented 3 years ago

I've tried running the automated workflow on my dataset and am getting the below message.

sc = load10X("/path/to/output") 
sc = autoEstCont(sc)
# 127 genes passed tf-idf cut-off and 35 soup quantile filter.  Taking the top 35.
# Using 252 independent estimates of rho.
# Estimated global rho of 0.75
# Error in setContaminationFraction(sc, contEst, forceAccept = forceAccept) : 
# Extremely high contamination estimated (0.75).  This likely represents a failure in estimating the contamination fraction.  Set forceAccept=TRUE to proceed with this value.

I'm very new to bioinformatics and scRNA-seq analysis and am wondering how to proceed. What should I do to check if this is "real" before moving on and correcting expression profile.

I've been trying to do some of the visual sanity checks such as mentioned in the vignette but it seems I first need to do the "manual method" to estimate the contamination fraction. However after reading through the vignette several times I'm still confused on the exact code I need to run. I keep running into error "'x' must be an array of at least two dimensions".

constantAmateur commented 3 years ago

If your data really does have 75% contamination, you probably shouldn't use it as that level of contamination likely indicates something went very wrong in the experiment.

To check, I would look at the plot generated by autoEstCont. Does it have two peaks of roughly equal height, with one being around .75? If so, try setting your contamination to the location of the lower peak and proceeding with your analysis.

The other thing I'd do is make extensive use of plotMarkerMap to see what the expression ratio to the soup looks like for a few genes that are commonly contamination. Without knowing your experiment it's hard to say what these are likely to be, but HB and IG genes usually work.

mibdx-dev commented 3 months ago

Hi,

I am getting the same error with one my Sample "Error in setContaminationFraction(sc, exp(coef(sc$fit)), forceAccept = forceAccept) : Extremely high contamination estimated (0.61). This likely represents a failure in estimating the contamination fraction. Set forceAccept=TRUE to proceed with this value."

Thea autoEstCont looks very different. I am using only one gene set nonExpressedGeneList = list(Hep=c("CYP1A2","CYP2E1","CYP3A4","GLUL","DCXR","FTL","GPX2","GSTA1","CYP2A7","FABP1","HAL","AGT","ALDOB","SDS"))

Interestingly the same Sample works when I am using a big geneset of which this one is part as well. Hep geneset is part of the lists below as well. However with these list some other samples fail.

nonExpressedGeneList = list( AntiB=c("IGKC","JCHAIN","IGHA1","IGLC1","IGLC2","IGLC3"), MatB=c("CD22","CD37","CD79B","FCRL1","LTB","DERL3","IGHG4"), CD3T=c("CD8A","CD8B","CD3D","CD3G","TRAC","IL32","TRBC1","TRBC2"), Hep=c("CYP1A2","CYP2E1","CYP3A4","GLUL","DCXR","FTL","GPX2","GSTA1","CYP2A7","FABP1","HAL","AGT","ALDOB","SDS"), LSEC=c("FCN2","CLEC1B","CLEC4G","PVALB","S100A13","GJA5","SPARCL1","CLEC14A","PLVAP","EGR3"), Eryth=c("HBB","HBA1","HBA2"), NKT=c("CSTW","IL7R","GZMB","GZMH","TBX21","HOPX","PRF1","S100B","TRDC","TRGC1","TRGC2","IL2RB","KLRB1","NCR1","NKG7","NCAM1","XCL2","XCL1","CD160","KLRC1"), Mac=c("VCAN","S100A8","MNDA","LYZ","FCN1","CXCL8","VCAN","VCAM1","TTYH3","TIMD4","SLC40A1","RAB31","MARCO","HMOX1","C1QC"), Chol=c("PROM1","SOX9","KRT7","KRT19","CFTR","EPCAM","CLDN4","CLDN7","ANXA4","TACSTD2"), Stel=c("ACTA2","COL1A1","RBP1","TAGLN","ADAMTSL2","GEM","LOXL1","LUM"), Endo=c("PECAM1","TAGLN","VWF","FLT1","MMRN1","RSPO3","LYPD2","LTC4S","TSHZ2","IL1R1") )

So in short, I need to solve this issue and any help will be appreciated.

Thank you

Liver-10_autoEstCont_Just_ToPlot