MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
6.14k stars 766 forks source link

Improve Cluster Rand Score in BERTopic #1445

Open shaistaDev7 opened 1 year ago

shaistaDev7 commented 1 year ago

Hi Maarten! Thank You for this awesome library that makes topic modelling to much easy. I am really impress to this library and show best library compare to other topic modelling technique. My focus is on evaluation metrics (Cluster Rand Index). My Documents consist of five categories. Each document assign to one category. Details of Documents and categories are:

  1. Business->198
  2. Entertainment->210
  3. Sports->200
  4. Health->200
  5. Weird->200

Because of my focus to improve rand score. I select K-means clustering with n_clusters=5 over HDBSCAN to avoid the creation of outliers. After conducting experiment, I have obtained the following counts for each topic or cluster: image

Here's my code

from sentence_transformers import SentenceTransformer
model = SentenceTransformer('sentence-transformers/paraphrase-multilingual-mpnet-base-v2')
embeddings = model.encode(documents, show_progress_bar=True)
print(embeddings)
from bertopic.vectorizers import ClassTfidfTransformer
ctfidf_model = ClassTfidfTransformer(bm25_weighting=True, reduce_frequent_words=True)
from umap import UMAP
dim_model = UMAP(n_neighbors=100, 
                    n_components=50, 
                    min_dist=0.3, 
                    metric='cosine', 
                    random_state=42)
from sklearn.cluster import KMeans
cluster_model = KMeans(n_clusters=5)
topic_model = BERTopic(language="urdu", low_memory=True ,calculate_probabilities=True, vectorizer_model=vectorizer_model,seed_topic_list=seed_topic_list, top_n_words=10, hdbscan_model=cluster_model, umap_model=dim_model,ctfidf_model=ctfidf_model,verbose=True)
topics, probs = topic_model.fit_transform(documents, embeddings)

I tried different clustering techniques and UMAP parameters. I wanna improve clustering technique for rand score. May be its stupid question but I’m beginner and wanna learn more .Any suggestion please. Thank you Again

MaartenGr commented 1 year ago

Optimizing for RAND score can be tricky but it should be possible by trying out different algorithms and parameters. Do note that you can still use HDBSCAN to perform the clustering and then use .reduce_outliers to map all outliers to topics. That way, you can have the best of both worlds. Moreover, if you are setting n_components=50, then there are still quite a lot of features generated which tend to enable the curse of dimensionality. k-Means does not work well with that many features out of the box, so setting a smaller size might be beneficial. Other than that, you could also use cosine distance in k-Means instead to account for it.

shaistaDev7 commented 1 year ago

Thank you @Maarten for your kind reply. Is it possible to reduce all outliers? I don't know how to set parameters of HDBSCAN for my case. I need five topics because of 5 categories. I set these parameters of HDBSCAN:

from hdbscan import HDBSCAN
cluster_model = HDBSCAN(min_cluster_size=5, metric='euclidean', cluster_selection_method='eom', prediction_data=True)

and set nr_topics=6 in bertopic. after this I get image Please let me know how I reduce all outliers to topics? I have also tried by setting the smaller size of n_components and different clustering algorithm like Birch, agglomerative clustering with different dimensionality reduction algorithms like PCA, SVD but still not improve RI Score. I have no idea how I change metrics cosine in K-means clustering and passed to BERTopic model. If it possible share with me refences of these questions. Thank You!

MaartenGr commented 1 year ago

To reduce all outliers, you can use .reduce_outliers. The documentation should have a full example of how to use it.

You cannot change distance metrics in k-Means I believe but if you normalize the input then it should be a relatively similar measure. Having said that, I would advise keeping the n_components small instead and use HDBSCAN since those generally work best.