Open shaistaDev7 opened 1 year ago
Optimizing for RAND score can be tricky but it should be possible by trying out different algorithms and parameters. Do note that you can still use HDBSCAN to perform the clustering and then use .reduce_outliers
to map all outliers to topics. That way, you can have the best of both worlds. Moreover, if you are setting n_components=50
, then there are still quite a lot of features generated which tend to enable the curse of dimensionality. k-Means does not work well with that many features out of the box, so setting a smaller size might be beneficial. Other than that, you could also use cosine distance in k-Means instead to account for it.
Thank you @Maarten for your kind reply. Is it possible to reduce all outliers? I don't know how to set parameters of HDBSCAN for my case. I need five topics because of 5 categories. I set these parameters of HDBSCAN:
from hdbscan import HDBSCAN
cluster_model = HDBSCAN(min_cluster_size=5, metric='euclidean', cluster_selection_method='eom', prediction_data=True)
and set nr_topics=6
in bertopic. after this I get
Please let me know how I reduce all outliers to topics? I have also tried by setting the smaller size of n_components
and different clustering algorithm like Birch, agglomerative clustering with different dimensionality reduction algorithms like PCA, SVD but still not improve RI Score. I have no idea how I change metrics cosine in K-means clustering and passed to BERTopic model. If it possible share with me refences of these questions. Thank You!
To reduce all outliers, you can use .reduce_outliers
. The documentation should have a full example of how to use it.
You cannot change distance metrics in k-Means I believe but if you normalize the input then it should be a relatively similar measure. Having said that, I would advise keeping the n_components
small instead and use HDBSCAN since those generally work best.
Hi Maarten! Thank You for this awesome library that makes topic modelling to much easy. I am really impress to this library and show best library compare to other topic modelling technique. My focus is on evaluation metrics (Cluster Rand Index). My Documents consist of five categories. Each document assign to one category. Details of Documents and categories are:
Because of my focus to improve rand score. I select K-means clustering with n_clusters=5 over HDBSCAN to avoid the creation of outliers. After conducting experiment, I have obtained the following counts for each topic or cluster:
Here's my code
I tried different clustering techniques and UMAP parameters. I wanna improve clustering technique for rand score. May be its stupid question but I’m beginner and wanna learn more .Any suggestion please. Thank you Again