Use scDblFinder - Githubissues

LTLA commented 3 years ago

Demo scDblFinder's main function in the doublet section. Also show how to derive hard yes/no doublet calls from scores.

LTLA commented 3 years ago

@plger see whether my commentary in 046c1887d61ac2167cf0ea32e4bfae8a804d9068 is fair.

plger commented 3 years ago

Looks good to me! I made a PR adding the co-expression score being included in the initial score (the code is now inside the package and it's so computationally inexpensive and makes things a little more robust, so that I'll be removing the option to disable it) and fixed a typo. Also, about "interpreting these scores in the context of cluster annotation": I normally do it on both levels, i.e. remove doublets called by scDblFinder, and remove clusters where most cells were doublets. At any rate, it might be important not to give the reader the impression that doublets will necessarily form their own clusters, so I just added a little something to this effect (I know you're somewhat warning about this a bit already). Feel free to accept or not...

Finally, regarding this question of mixing proportions, I noticed that removing the highest and lowest libsize percentiles when generating artificial doublets improves performance, as expected. I therefore tried to establish the mixing proportions by steering the library sizes towards the clusters' median, thinking that the latter offered a better guide at the actual RNA amount, but no matter how I did it this decreased performance, which rather puzzled me... I don't suppose you have a good dataset with spike-ins and SNP-based doublets?

LTLA commented 3 years ago

Feel free to accept or not...

Done. Fiddled with the wording a bit, but otherwise merged.

I don't suppose you have a good dataset with spike-ins and SNP-based doublets?

Not beyond some toy examples - as in, real datasets but those explicitly generated to look at spike-ins.

I think the biggest "problem" with the use of spike-ins for doublet removal is that spike-ins are only available for plate-based protocols where the frequency of doublets from cell sorting is much lower. I guess that this is because they have already done their own doublet exclusion based on the forward/side scatter as it comes out of the machine.

The medians idea is a sensible one, though I'm also not surprised that it didn't work as well as you hoped. I have suspected that the cluster-average library sizes are not linear functions of the population's RNA content, based on observations of the library sizes of doublet and source clusters from findDoubletClusters(). I'll guess that some reagent is being exhausted inside the droplet; at which point it doesn't matter how much extra RNA content you add, you're probably going to get the same library size.

plger commented 3 years ago

That's interesting. Under the reagent exhaustion hypothesis I suppose we'd get some kind of plateau, and since this isn't visible in the libsize distributions of populations (which do seem log-normal), I guess that means that most of the libsize variation within a population happens downstream of these initial reactions? (and yet somehow upstream of amplification, since it's in the UMIs?) Or could it be the amounts of reagent per droplets varying in the first place?

LTLA commented 3 years ago

Possibly. In the most extreme case, you could imagine that the limiting factor is the reverse transcriptase activity; all other things being equal (same activity in each droplet, all other reagents in excess), you would get the same library size regardless of the total RNA content of each cell. In practice, this is probably some complicated function of the amount of RNA, primer and free dNTPs; while I don't remember all my enzyme kinetics, it's not hard to imagine that this function is not linear with respect to the amount of RNA in the cell. Throw in the droplet-to-droplet variation in RT processitivity/reagent molarity and we get what we see.

plger commented 3 years ago

yeah okay, I guess the upshot is that it's probably too messy to do much more than we're already doing :)

LTLA commented 3 years ago

Agreed. Might be worth checking the ratios of the average library sizes between the doublet clusters and their putative sources in your other test datasets, to see how general this phenomenon is. (Assuming you know or are pretty confident of the sources for each doublet cluster.) Then at least we'd know what to blame when the methods don't perform as well.

Bioconductor / OrchestratingSingleCellAnalysis

Use scDblFinder #51