JonathanShor / DoubletDetection

Doublet detection in single-cell RNA-seq data.
https://doubletdetection.readthedocs.io/en/stable/
MIT License
81 stars 23 forks source link

Cluster of origin for doublets and results interpretation #113

Closed danshu closed 5 years ago

danshu commented 5 years ago

Hi,

I like this tool DoubletDetection and ran it on my datasets. For some samples, I have doublet rates close to 10% and a few (1-4) doublet clusters. I'm curious that whether DoubletDetection can also out clusters of origin for each doublet, which should be very useful. On the other hand, the number of doublet cluster is low relative to the number of clusters in the tSNE plot. For example, if there are 5 major clusters in the tSNE plot, types of cross-type doublets can be as large as 10, so how many doublet clusters are expected to be detected?

Best, Danshu

adamgayoso commented 5 years ago

Hi Danshu,

You can access the index of the two cells that created each doublet with the parents_ attribute. From there, you can probably come up with a creative way of visualizing the distribution of parent type pairs for the cells that DoubletDetection calls doublets. Keep in mind that cells are clustered in an augmented data matrix with real and synthetic cells. This means that while you see 1-4 doublet clusters in the original data matrix, there are likely many more in the augmented matrix. All the information to do this is stored as attributes in our classifier. Thus, this plot would require the parent information of each synthetic doublet, as well as the cluster ID of the synthetic doublets and real cells during one iteration of DoubletDetection. I'm thinking you could randomly assign some real cells the parent information of the synthetic doublets in an enriched cluster, so that in the original data matrix you have the parent information.

As for your example of having 5 major clusters, it's not so simple to bound how many cross-type doublet types we expect to see. This is a function of the size difference of the 5 major clusters, as well as the clustering algorithm used. For example, if 2 of the 5 clusters are very small, we would expect few cross-type doublets of this pairing. Therefore, in the original data matrix, the doublets of this pair may not cluster on their own due to there just not being enough of them.

Please let me know if you have any other questions.

danshu commented 5 years ago

Thanks! I'm working on it now !