Open schoobani opened 1 year ago
But in practice how can we trust a single randomly picked random seed, while running it with another random seed generates a set of new clusters, and assigns docs to different clusters?
This applies not just to HDBSCAN but to essentially any algorithm that has a random_state
parameter. Even k-Means has one for initializing the clusters which could generate different results.
This behaviour even escalates more by changing other UMAP, HDBSCAN parameters.
And it becomes even more difficult when you also implement evaluation metrics for the resulting topics that are created. Clustering is just one component of the entire pipeline and not the only piece of the puzzle that could be evaluated.
Since there are no alternative clustering measures for un-supervised topic modeling, shouldn't there exist a Consensus Clustering framework, where we can assign docs that are falling together all the time, to a final cluster?
Yes, and no. There are clustering metrics that could be used for unsupervised topic modeling, like silhouette score which gives you an idea of the "quality" of the clusters. A very rough proxy, but it still gives information. However, it does not tell you something directly about the "quality" of the resulting clusters. Topic coherence and diversity are not considered, for example.
So whilst consensus clustering could definitely be interesting, it is just one piece of the evaluation puzzle. For example, even if we find a number of documents that are falling together most of the time, it does not tell you whether they are falling correctly most of the time. Relying on only consensus clustering for stabilization can therefore be detrimental to performance without evaluating other aspects of the model.
Consensus Clustering is an important aspect when it comes to using any clustering method in production. I don't want my users facing a set of new clusters each time deploying the pipeline. You have suggested fixating UMAP random seeds to stabilise the clusters. But in practice how can we trust a single randomly picked random seed, while running it with another random seed generates a set of new clusters, and assigns docs to different clusters? This behaviour even escalates more by changing other UMAP, HDBSCAN parameters.
Since there are no alternative clustering measures for un-supervised topic modeling, shouldn't there exist a Consensus Clustering framework, where we can assign docs that are falling together all the time, to a final cluster?