About the true labels for calculating the F1 scores

li000678 commented 3 years ago

Hi, I am interested in testing the PARC in more datasets. I am wondering what is the source of the true labels? do you use the labels provided by the original papers? For example, the labels in the ''clusters.csv" from the clustering analysis of the PBMC dataset provided by 10xGenomics: pbmc_68k?

Thank you! Yijia

ShobiStassen commented 3 years ago

Hi Yijia,

The PBMC labels are based on the annotations made by the authors of the original paper. You can check out their GitHub page which provides the Rcode for how they annotate the mixed PBMCs based on pure PBMC populations. This is how we got the annotations provided in PARC (the annotations can be downloaded from the PARC readme link or you can run the Rcode by Zheng et al) Hope that helps

li000678 commented 3 years ago

Hi Shobi, Thank you, I looked into the procedures of how Zheng et al did for the clustering is: firstly use k-means to generate 9 clusters and then divide cluster No.9 into two clusters. I personally think the predefined cluster produced by k-means may not be accurate (it's obvious when comparing 'the ground truth' with the results from PARC, PARC seems to do a better job). What do you think? I am planning to validate it in more datasets, though the labels of many other datasets are also annotations based on clusters identified by clustering algorithms.

ShobiStassen / PARC

About the true labels for calculating the F1 scores #15