Open jihyunmd opened 1 year ago
I think this is a result of the embedding model and not so much the vectorizer. What is likely happening is that there are clusters of documents that are semantically difficult to combine and that the embedding model has found those documents to be similar based on their n-grams on a character-level. It would be worthwhile to explore the documents in those topics to see if that is indeed the case. If it is, then it might be worthwhile to try out an embedding technique that is more accurate. For example, the MTEB Leaderboard is a nice place to start exploring embedding techniques for clustering purposes.
Thanks so much! If I understand you correctly, if I want to do NLP on medical notes, I should use the appropriate embedding model instead of all-MiniLM-L6-v2?
And I would like to ask one more question: I got info=model.get_topic_info() and the order of topics are not in the order of the numbers of documents, is it because I've used reduce_outliers?
It depends on the documents themselves but trying out models that work best for the data that you have is preferred.
I got info=model.get_topic_info() and the order of topics are not in the order of the numbers of documents, is it because I've used reduce_outliers?
Yes, that is indeed the case if additional updates were done.
Hi @MaartenGr, I want to thank you for developing bertopic, as it has been instrumental in the smooth progress of our project. And I am truly grateful for providing us such an active/progressive discussion and solutions.
I am dealing with 2 million documents and below is the main code:
` from bertopic import BERTopic from hdbscan import HDBSCAN from umap import UMAP
import numpy as np
from sentence_transformers import SentenceTransformer sentence_model = SentenceTransformer("all-MiniLM-L6-v2") embeddings = sentence_model.encode(docs, show_progress_bar=False)
from sklearn.feature_extraction.text import CountVectorizer
umap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine', random_state=42, low_memory=True)
vectorizer_model = CountVectorizer(ngram_range=(1, 2), stop_words="english", min_df=100) model = BERTopic(vectorizer_model=vectorizer_model, umap_model=umap_model, calculate_probabilities=False, nr_topics=500)
topics, probabilities = model.fit_transform(docs) red_topics = model.reduce_outliers(docs, topics, strategy='c-tf-idf') model.update_topics(docs, topics=red_topics, vectorizer_model=vectorizer_model) ` And I found some of the topics are just the group of topic words starting with same letters as (I made them up to just give some examples) 'xabvd, xser, xwesd, xrfde' 'jfrsd, jresa, jliok, joiun' 'dau' 'dan' 'daud' or ends with same letters as 'calcium' 'valium' 'xxxxium'
Does this come from CountVectorizer? And is there any way to fix this? Thank you so much!
Respectfully, Ji Hyun