broadinstitute / CellBender

CellBender is a software package for eliminating technical artifacts from high-throughput single-cell RNA sequencing (scRNA-seq) data.
https://cellbender.rtfd.io
BSD 3-Clause "New" or "Revised" License
271 stars 50 forks source link

seeking guidance for the rank to set for empty droplets #333

Open Domoun opened 4 months ago

Domoun commented 4 months ago

Hi, First, I would like to thank you for this amazing tool. I have tried several packages for ambient RNA removal and so far, CellBender is really the best in class!

Nonetheless, I have some trouble with a couple of samples. Although cell viability was high and the clustering with raw data is good, the barcode rank plot is a bit unusual with no clear cliff preceding background: 10x Consequently, CellBender is having a hard time identifying cells/empty droplets.

I have found that providing the expected-cells argument using the number detected by CellRanger (that is consistent with the actual number of cells loaded/recovered) helps identifying true cells. However, the decontamination was very minimal - if there was any - because Cellbender actually set the threshold for empty cells after the knee (60K) by default. Therefore, I have tried to set the threshold myself using total-droplets-included at 27K. This really improved the removal of ambient transcripts, but some genes (that I know are not supposed to be ubiquitously expressed) were still very insufficiently decontaminated. I thus did a new test using the rank at which CellRanger had identified the last cell (24K). It improved the decontamination further, but it is still not as good as what I can get with other samples. Following integration of all datasets, the partial decontamination is really impairing the clustering. I am already using FPR = 0.05 and I would rather reduce again the rank for empty droplets instead of increasing the FPR. Would you have any recommendation about the choice for total-droplets-included ? Can I set a number that is within the low-probability range according to CellRanger output? Is there any objective and robust way to pick this number with this kind of knee plot? Please find below the knee plots with the different thresholds tested and the corresponding outputs from cellbender reports. Thank you so much for your guidance. rep3 tests