constantAmateur / SoupX

R package to quantify and remove cell free mRNAs from droplet based scRNA-seq data
253 stars 34 forks source link

Setting optimum threshold for contamination #42

Closed tomthomas300 closed 3 years ago

tomthomas300 commented 4 years ago

Firstly wanted to say thank you for creating such an excellent package, and a great paper - very intuitive principles using the empty droplets to build a profile of ambient RNA expression!

My question is around what to do when the estimation genes are present in every cell in certain samples. In my case, I am working with gut tissue, and using IG genes to assess contamination. In some gut samples, I keep getting the message from SoupX that there are no non-expressing cells. This is biologically not possible because I have all cell types in my samples (even T-cells haha) unless of course contamination is involved. The below command produces this notification:

estimateNonExpressingCells(sc, nonExpressedGeneList = list(IG = igGenes))

This is upon providing the standard 10X cellranger clustering, and the suggestion from the SoupX package is to set cluster = FALSE. However in other gut samples, the provided 10X cellranger clusters works well. My choice to move forward seems to be either:

i. set an artificially high contamination rate for all samples (say 0.1 as per your vignette)

ii. set cluster = FALSE in all samples, treating each cell as its own cluster (In samples where providing 10X clustering works, i find that setting cluster = FALSE doubles the estimated contamination fraction - but this is actually still well below the 0.1 in option i.)

iii. set artificially high contamination rate for samples (0.1) where I bump in to the 'no non-expressing cell ' problem but provide standard 10x cell ranger clustering info for samples where this is not an issue

iv. set cluster = FALSE instead of setting an artificially high contamination rate for samples where I bump into the the 'no non-expressing cell ' problem, but provide standard 10x cell ranger clustering info for samples where this is not an issue

I am hesitant to pursue options iii. and iv. as these introduce artificial effects on a particular subset of samples by treating these subsets differently (please correct me if I am wrong here). I think part of it is because I believe these 'highly contaminated samples' are inflamed gut tissue, so I think there is a biological reason behind this seperation! So, I would rather treat all samples homogeneously. Do you think this is correct?

The second and more important question then becomes: which is the safer option, to go towards: option i. or ii., especially in knowledge that I need to do differential expression downstream.

Your comment in "https://github.com/constantAmateur/SoupX/issues/32", suggests you might prefer an artificially high threshold, but I suppose my question to you is, in this setting outlined above would you still pick an artificially high threshold over, setting cluster = FALSE and treating each cell like its own cluster?

Also, in both i. and ii. is it ok to pass clustering information at the adjustCounts step despite not having given clusters for the preceding estimateNonExpressingCells step?

Thank you ever so much again!

constantAmateur commented 3 years ago

I know this issue is quite old now and you've probably settled on a solution, but I'm replying in case it is of use to others. My current suggestion would be that you try using the automated "autoEst" function to calculate the contamination fraction. But if issues persist I think that:

  1. You're right to want to treat all samples consistently.
  2. Before setting cluster = FALSE I would try changing the parameters of estimateNonExpressingCells to less aggressively conclude that cells genuinely express a gene set. To do this, either increasing maximumContamination if it's set to something below 1, or decreasing FDR.
  3. The advice to artificially set the contamination fraction high depends very much on your biological question. Setting the contamination fraction higher than the truth will over-correct the data and remove real counts. This can be a price worth paying in some cases, but that's a choice you need to make. I will say that in most (chromium 10X, 3' with V3 chemistry) experiments I've looked at contamination in the range 5-10% is the norm, so 10% is not crazily high in that context.

In your particular case, it sounds like you have multiple samples that are biologically fairly similar and generated using the same technology/protocols. In such cases, I've found that the contamination rate tends to be fairly consistent between samples and you might just want to manually set the contamination fraction for the ones where cluster=TRUE gives problems to the average of the ones that work well.