Closed stat-hejia closed 3 years ago
The benchmarking is not included in the too-many-cells
tool itself. Although, you can always use the diversity
entry point to get the diversity of labels for the leaf nodes to see if they are close to 1. For the manuscript, the purity, entropy, and NMM were calculated post-clustering for all algorithms (to be consistent).
It seems to 'diversity' quantitate the effective number of cell states within a population, also can be used to compare the accuracy of clustering algorithms, Is my understanding right? I read your paper and the help document about too-many-cells, But I don't understand how 'diversity' is used to measure accuracy of clustering. It would be my pleasure if you could tell me something about it, or How can I supplement this knowledge?
Yes, diversity can be used to compare. Diversity of order 1, for instance, is a transformation of Shannon entropy which translates it to a more biological context. I recommend reading https://onlinelibrary.wiley.com/doi/10.1111/j.2006.0030-1299.14714.x to understand the important distinction. We used more traditional comparison measures in the paper to make it more familiar. If you want to use another measure, however, you would have to calculate it yourself from the clustering output, although too-many-cells
is more about separating than stopping, as the visualization can guide your chosen cluster size.
I studied the literature you recommended and got a preliminary understanding of relationship about diversity and entropy. Thanks a lot for your help!
Dear Gregory: Thanks for building such a nice tool. I want to compare the accuracy of clustering algorithms, and measure how close between the clusters and the true labels, which called 'Cluster purity' in your article. I am a bit confused about this part, but I can not find this parameter in the help page of too-many-cell. I tried to run the source code, but I have not learned the programming software used in 'purity' part, and it is difficult to me at present. I wanna to ask whether there has a parameter of purity of the too-many-cell pipeline which I may ignored?Or whether you have the R code about 'purity' that can provide to me for reference? Thanks for your time!