MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
6.21k stars 767 forks source link

Some topics have topic words starts with same letters or end with same letters #1385

Open jihyunmd opened 1 year ago

jihyunmd commented 1 year ago

Hi @MaartenGr, I want to thank you for developing bertopic, as it has been instrumental in the smooth progress of our project. And I am truly grateful for providing us such an active/progressive discussion and solutions.

I am dealing with 2 million documents and below is the main code:

` from bertopic import BERTopic from hdbscan import HDBSCAN from umap import UMAP

import numpy as np

from sentence_transformers import SentenceTransformer sentence_model = SentenceTransformer("all-MiniLM-L6-v2") embeddings = sentence_model.encode(docs, show_progress_bar=False)

from sklearn.feature_extraction.text import CountVectorizer

umap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine', random_state=42, low_memory=True)

vectorizer_model = CountVectorizer(ngram_range=(1, 2), stop_words="english", min_df=100) model = BERTopic(vectorizer_model=vectorizer_model, umap_model=umap_model, calculate_probabilities=False, nr_topics=500)

topics, probabilities = model.fit_transform(docs) red_topics = model.reduce_outliers(docs, topics, strategy='c-tf-idf') model.update_topics(docs, topics=red_topics, vectorizer_model=vectorizer_model) ` And I found some of the topics are just the group of topic words starting with same letters as (I made them up to just give some examples) 'xabvd, xser, xwesd, xrfde' 'jfrsd, jresa, jliok, joiun' 'dau' 'dan' 'daud' or ends with same letters as 'calcium' 'valium' 'xxxxium'

Does this come from CountVectorizer? And is there any way to fix this? Thank you so much!

Respectfully, Ji Hyun

MaartenGr commented 1 year ago

I think this is a result of the embedding model and not so much the vectorizer. What is likely happening is that there are clusters of documents that are semantically difficult to combine and that the embedding model has found those documents to be similar based on their n-grams on a character-level. It would be worthwhile to explore the documents in those topics to see if that is indeed the case. If it is, then it might be worthwhile to try out an embedding technique that is more accurate. For example, the MTEB Leaderboard is a nice place to start exploring embedding techniques for clustering purposes.

jihyunmd commented 1 year ago

Thanks so much! If I understand you correctly, if I want to do NLP on medical notes, I should use the appropriate embedding model instead of all-MiniLM-L6-v2?

And I would like to ask one more question: I got info=model.get_topic_info() and the order of topics are not in the order of the numbers of documents, is it because I've used reduce_outliers?

MaartenGr commented 1 year ago

It depends on the documents themselves but trying out models that work best for the data that you have is preferred.

I got info=model.get_topic_info() and the order of topics are not in the order of the numbers of documents, is it because I've used reduce_outliers?

Yes, that is indeed the case if additional updates were done.