GregorySchwartz / too-many-cells

Cluster single cells and analyze cell clade relationships with colorful visualizations.
https://gregoryschwartz.github.io/too-many-cells/
GNU General Public License v3.0
105 stars 19 forks source link

Purity section in too-many-cells #35

Closed stat-hejia closed 3 years ago

stat-hejia commented 4 years ago

Dear Gregory: Thanks for building such a nice tool. I want to compare the accuracy of clustering algorithms, and measure how close between the clusters and the true labels, which called 'Cluster purity' in your article. I am a bit confused about this part, but I can not find this parameter in the help page of too-many-cell. I tried to run the source code, but I have not learned the programming software used in 'purity' part, and it is difficult to me at present. I wanna to ask whether there has a parameter of purity of the too-many-cell pipeline which I may ignored?Or whether you have the R code about 'purity' that can provide to me for reference? Thanks for your time!

GregorySchwartz commented 4 years ago

The benchmarking is not included in the too-many-cells tool itself. Although, you can always use the diversity entry point to get the diversity of labels for the leaf nodes to see if they are close to 1. For the manuscript, the purity, entropy, and NMM were calculated post-clustering for all algorithms (to be consistent).

stat-hejia commented 4 years ago

It seems to 'diversity' quantitate the effective number of cell states within a population, also can be used to compare the accuracy of clustering algorithms, Is my understanding right? I read your paper and the help document about too-many-cells, But I don't understand how 'diversity' is used to measure accuracy of clustering. It would be my pleasure if you could tell me something about it, or How can I supplement this knowledge?

GregorySchwartz commented 4 years ago

Yes, diversity can be used to compare. Diversity of order 1, for instance, is a transformation of Shannon entropy which translates it to a more biological context. I recommend reading https://onlinelibrary.wiley.com/doi/10.1111/j.2006.0030-1299.14714.x to understand the important distinction. We used more traditional comparison measures in the paper to make it more familiar. If you want to use another measure, however, you would have to calculate it yourself from the clustering output, although too-many-cells is more about separating than stopping, as the visualization can guide your chosen cluster size.

stat-hejia commented 4 years ago

I studied the literature you recommended and got a preliminary understanding of relationship about diversity and entropy. Thanks a lot for your help!