MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
5.85k stars 727 forks source link

How to merge topics automatically after getting the potential hierarchy of all topics #1896

Open syGOAT opened 4 months ago

syGOAT commented 4 months ago

I have read this part of the official document: https://maartengr.github.io/BERTopic/getting_started/hierarchicaltopics/hierarchicaltopics.html#visualizations:~:text=Merge%20topics,-%C2%B6 It is realy a great way to creat the potential hierarchical nature of topics and merge topics! I have a further question, which is how to to merge topics automatically after getting the potential hierarchy of all topics? For example, when I ran:

from scipy.cluster import hierarchy as sch

linkage_function = lambda x: sch.linkage(x, 'single', optimal_ordering=True)
hierarchical_topics = model.hierarchical_topics(abstracts, linkage_function=linkage_function)
model.visualize_hierarchy(hierarchical_topics=hierarchical_topics)

The figure is bellow: image I think the horizontal axis is 'distance' between each topic. The merge method in official document is specifying the indexes of the topics. How can I merge topics automatically if the 'distance' between two topics is less than a sertain number, such as 0.3? My model is defined like this:

umap_model = UMAP(n_neighbors=20, n_components=15, min_dist=0.0, metric='cosine', random_state=42)
cluster_model = KMeans(n_clusters=100, random_state=42)  
vectorizer_model = CountVectorizer(stop_words="english")

seed_words = [
    'materials','physical', ...
]
ctfidf_model = ClassTfidfTransformer(
    seed_words=seed_words, 
    seed_multiplier=5
)
model = BERTopic(embedding_model='./multilingual-e5-large-instruct', 
                 umap_model=umap_model,
                 min_topic_size=50,
                 ctfidf_model=ctfidf_model,        
                 hdbscan_model=cluster_model,
                 vectorizer_model=vectorizer_model, 
)

Vision:

Name: bertopic
Version: 0.16.0
Summary: BERTopic performs topic Modeling with state-of-the-art transformer models.
Home-page: https://github.com/MaartenGr/BERTopic
Author: Maarten P. Grootendorst
Author-email: maartengrootendorst@gmail.com
MaartenGr commented 4 months ago

When you run .hierarchical_topics, you get a dataframe that specifies the potential merging of topics and their distances, namely the hierarchical_topics variable in your code.

You can use this to select a threshold that you think works best for your use case. Based on the filtered dataframe, you can then extract the sets of topics that should be merged and merge them with .merge_topics.