constantAmateur / SoupX

R package to quantify and remove cell free mRNAs from droplet based scRNA-seq data
248 stars 34 forks source link

SoupX high contamination guideline #154

Open pedriniedoardo opened 2 months ago

pedriniedoardo commented 2 months ago

Hello, thank you very much for developing this great tool! I have encountered an issue with the processing of a sample, (I already know it is a problematic sample since the start. During the sample preparation there were some loading issue). Specifically, the autoEstCont suggest an Extremely high rho.

sc1 <- autoEstCont(sc)
5380 genes passed tf-idf cut-off and 2448 soup quantile filter.  Taking the top 100.
Using 583 independent estimates of rho.
Estimated global rho of 0.63
 Error in setContaminationFraction(sc, contEst, forceAccept = forceAccept) : 
  Extremely high contamination estimated (0.63).  This likely represents a failure in estimating the contamination fraction.  Set forceAccept=TRUE to proceed with this value.

The run fails. This is the distribution of the estimated rho(s). image

Now, I was trying to follow the recommendation suggested by @constantAmateur here https://github.com/constantAmateur/SoupX/issues/60#issuecomment-717049653

I have two options, and I am not sure which one is more appropriate.

  1. I have tried to use the setContaminationFraction based on the distribution from the failed autoEstCont call, and set a manual value of 0.1 which is roughly the center of the first peak.

    sc3 <- setContaminationFraction(sc,contFrac = 0.1)
  2. I have tried to use the contaminationRange argument in autoEstCont and reduced the total range for the estimates

    sc4 <- autoEstCont(sc,contaminationRange = c(0.01,0.5))
    5380 genes passed tf-idf cut-off and 2448 soup quantile filter.  Taking the top 100.
    Using 267 independent estimates of rho.
    Estimated global rho of 0.06

    The run does not fail. This is the result distribution of the estimated rho(s). image

I was wondering: a) Does it make sense to force a manual rho in this case, or would you discard the sample? b) Would you suggest approach 1 or 2? I have noticed that the distribution of approach 2 is different from the failed run of autoEstCont.