Use singleR to detect doublets ?

akramdi commented 4 years ago

Hello,

I was wondering if singleR can be used to detect doublets in a single cell experiment.

SingleR returns the best label from a reference, is there a way we could get the second best label for a given cell to confirm a suspicion of a doublet (suspicion based on observed genes co-expression indicating the presence of two cell types under the same barcode). More broadly, how can we take advantage of singleR to get a feeling if a cell is potentially a doublet ?

This may not be what singleR was meant for but I'd love to hear your thoughts about this.

Thanks a lot, Amira

LTLA commented 4 years ago

It's possible to get the second-best label from digging around in the scores matrix of the output. However, there is no guarantee that it will give you anything meaningful with respect to doublets. I can already think of a few counterexamples off the top of my head:

If you have multiple related labels in your reference (e.g., CD4 memory, CD4 effector and so on), they will probably get similar scores during assignment. A doublet that is assigned to one of these labels will probably get its relative as the second-best score, pushing out the other contributor.
If the doublet's expression profile is "similar" to that of another reference population, you'll get that as the assigned label rather than any first-second combo of the two actual contributors. It doesn't even have to be all that similar, just more similar than either of the two contributors.
It would be hard to choose the threshold on the score to decide whether or not a cell is a doublet or not. If you already knew it was a doublet, perhaps SingleR() could be used to identify the contributors, but this is not of much interest because we just want to get rid of the doublets.

Rather, I would suggest you use dedicated tools for removing doublets. A survey of some approaches is available in the OSCA book; similarly, you can also use scDblFinder for doublet simulation.

akramdi commented 4 years ago

Thank you, I really appreciate your detailed response.

I actually tried a couple of tools dedicated to doublet detection (DoubletFinder, Scrublet) with which I'm able to remove some but not all doublets. This is why I thought of singleR to help me detect/remove the remaining doublets. I didn't know about scDbIFinder, I'll give it a go.

In my case, my doublet suspicion is very precise and I'm looking to confirm/refute it with singleR. I think I have doublets made up of noradrenergic cells (tumor cells, they make up the majority of the sample) and normal cells from the microenvironnement. I digged around the scores matrix to get a feeling and I'm getting interesting results:

Most of the cells that are labelled as "Neurons" have "Neuroepithelial_cell" as second best (or the other way around), I am OK with these cells. Even if these are doublets, they are likely homotypic and these are difficult to detect anyway.
Very few (less than %1 ) have either "Neurons/Neuroepithelial_cell" as first label and an unrelated second label (ex. Endothelial_cells; Pro-B_cell_CD34+..), or the other way around.
I've also looked at the few cells showing poor or ambiguous assignments (labelled with an NA value in pruned.labels field ), I would be tempted to consider these as potential doublets too along with the ones found in previous point.

Does this way of exploring/interpreting the results make sens?

You've mentioned interesting points about the score threshold to consider and I'm also thinking that the results might be influenced by the diversity of the chosen reference (I am working with HumanPrimaryCellAtlasData()).

LTLA commented 4 years ago

Does this way of exploring/interpreting the results make sens?

Maybe. As I said before, I could see how it could work, but I could also see how it might not work, and so it's hard to say. I think you would be better off using dedicated doublet detection approaches.

If you've got good enough clusters that your doublets fall into their separate cluster, consider using scran::doubletCluster; this will assemble evidence that a cluster does not consist of doublets of two other populations, and if you don't have strong evidence, well, it's probably doublets. This is a lot easier to interpret and has fewer assumptions than the simulation-based methods, but it assumes that you have reasonable clusters that distinguish doublets and their parents.

(Incidentally, it is not surprising that you cannot remove all doublets with simulated methods. They make so many assumptions about how doublets form that it's a wonder that they "work" at all.)

But honestly, if you already know the offending cluster, just look for two mutually exclusive markers for the putative contributing populations and show that they are co-expressed in the doublets. If your neurons are expressing the T cell receptor, I think that's a pretty strong case for being a doublet.

akramdi commented 4 years ago

I think you would be better off using dedicated doublet detection approaches.

I think so too. Also, looking at co-expression patterns sounds very reasonable in my case indeed, I'll explore this.

btw, thanks a lot for link to OSCA book, what a gold mine!

Best,

SingleR-inc / SingleR

Use singleR to detect doublets ? #131