Open FahriBilici opened 1 year ago
Most likely, it is a result of using nr_topics
. I believe it is iteratively aggregating topics. Generally, I would advise skipping over that parameter and controlling the number of topics with min_topic_size
instead.
İ was using “auto” otherwise it was almost 1000 different topics. How can i solve this if i dont use nr_topics?
The min_topic_size
describes the minimum size a topic can take. If you increase this value, then fewer topics can be created. If you decrease this value, more and smaller topics will be created. In other words, set min_topic_size
to a large value, like 100 and test it out without OpenAI to see if the number of topics that you get make sense to your use case and adjust min_topic_size
accordingly.
By the way, I just released a page and a Google Colab Notebook where you will find a bunch of best practices when using BERTopic. It contains a bunch of guidelines that generally result in a great performance and usability.
if i use nr_topics=auto it takes around 15 minutes for generating topics but once i add representation model it doesnt stop. İ will check min topic size and you best practice case
I've got the same issue. I am copying my model specifics here. The thing is dimensionality reduction and clustering steps finishes in less than 15 seconds, but the representation model which is a combination on the MMR and keyBertInspired is a nightmare now:
random_state = 14
um2 = UMAP(n_neighbors = 7,
n_components=2,
metric='cosine',
low_memory=False,
angular_rp_forest=True,
random_state=random_state)
reduced_embeddings = um2.fit_transform(embeddings)
umap_model_tr = UMAP(n_neighbors=7,
n_components=50, #15 was good
metric='cosine',
low_memory=False,
angular_rp_forest=True,
random_state=random_state)
# Set prediction_data to True as it is needed to predict new points later on
hdbscan_model_tr = HDBSCAN(# min_cluster_size = 20,
# max_cluster_size = 100,
min_samples = 1,
metric='manhattan',
cluster_selection_method='eom',
prediction_data=True)
topic_model = BERTopic(embedding_model = sentence_model,
verbose = True,
n_gram_range = (1, 2),
ctfidf_model = ClassTfidfTransformer(reduce_frequent_words=True, bm25_weighting=True),
vectorizer_model= TfidfVectorizer(stop_words=SWV, ngram_range=(1, 2), vocabulary = vocabulary, min_df=2),
umap_model = umap_model_tr,
hdbscan_model = hdbscan_model_tr,
calculate_probabilities = True,
representation_model = [MaximalMarginalRelevance(diversity=0.1), KeyBERTInspired()]
)
topics, probs = topic_model.fit_transform(docs, embeddings)
n_outlier = topic_model.get_topic_info()[topic_model.get_topic_info()[topic_model.get_topic_info()["Topic"] == -1]["Count"][0]
print(f"Number of Outliers: {n_outlier}")
topic_model.visualize_documents(docs,
topics = topic_model.topics_,
embeddings = embeddings,
reduced_embeddings = reduced_embeddings,
sample = 1,
hide_annotations = True,
hide_document_hover = False,
custom_labels = False,
custom_labels = False,
title= "<b>Documents and Topics</b>",
width= 1500,
height= 750)
I have access to a server with 80 cores, but I don't know how can I parallelized the representation for each topic so it takes less time
@mohammadm1985
The thing is dimensionality reduction and clustering steps finishes in less than 15 seconds, but the representation model which is a combination on the MMR and keyBertInspired is a nightmare now:
What do you mean by "nightmare"? Is it that it takes too long now? If so, how long?
Could you try it without the additional topic representations? Also, by setting min_samples=1, you are likely generating a very large number of topics which might suggest why it slows down for you. How many topics do you create? Lastly, what exactly is in sentence_model
?
@MaartenGr I've been running the model for 2 hours now and I don't get any results yet. I am using Jup notebook in JHUB and we are assigned about 80 cores, though it does not matter as the module is not coded for parallel analysis.
I've not seen the results for this model yet. The same model with half number of documents and min_cluster_size of 20 took 10 minutes to run. The number of docs is not high. just 4K docs. I am using all-mpnet-base-v2 for embeddings as I found it more reliable in my usecase. With min_sample greater than one I get too many noises and reduce noise results in distorted distributions. I'll try with higher numbers and update you regarding the speed.
The run just finished. yeah I got 291 topics, which does not seem to be good. I'll work around min_sample. Also does it make sense for 291 topics to take 2 hours to generate the topics? Am I on the right track and just need to optimize it?
Also, is there any way that I can run keyBERTtinspired or any other representation model after I do the clustering? Like manually change it to MMR and see the results? I saw you can change the vectorizer but how about the representation model itself?
The run just finished. yeah I got 291 topics, which does not seem to be good. I'll work around min_sample. Also does it make sense for 291 topics to take 2 hours to generate the topics? Am I on the right track and just need to optimize it?
With 4k documents, it generally should not take that long. Do you have a GPU enabled? Both MMR and KeyBERTInspired generated word embeddings and as such will need a GPU in order to quickly generate embeddings with sentence-transformers. You could also try to increase min_df
if there happens to be too large of a vocabulary.
Also, is there any way that I can run keyBERTtinspired or any other representation model after I do the clustering? Like manually change it to MMR and see the results? I saw you can change the vectorizer but how about the representation model itself?
Yes you can, you can use .update_topics
for that. You can update the topic representations without needing to rerun the entire topic model.
@MaartenGr
I use a tedious approach to filter my vocabulary with gensim word2vec dictionaries to limit to meaningful words and use keybert inspired approach to build the vocabulary and n_grams associated to that. That reduced my vocab size to 8000 which is not a lot. I have a thought though. I am using keyphrase_ngram_range =(1,2)
and also ngram_range = (1,2)
in my vectorizer. This makes me think maybe it is considering the combination of the phrases in my vocabulary...
Unfortunately, I don't have access to GPU computational resources right now. Instead, the server provides 80 cores to parallelize the processes. I think the representation model can work independently on each topic-document pair and can be parallelized. Isn't that something you may consider adding to the package?
Unfortunately, I don't have access to GPU computational resources right now.
I believe that is the main issue here. Both MMR and KeyBERTInspired create embeddings from your vocabulary and specific documents which is sped up with the GPU. Generally, it is not advised to use embedding models without a CPU.
I think the representation model can work independently on each topic-document pair and can be parallelized. Isn't that something you may consider adding to the package?
Seeing as documents/keywords are embedded which generally use torch, parallelization can be an issue. These are quite complex to parallelize, especially across the many backends that can be found in BERTopic.
I'm currently working on an example, but it's taking longer than expected. I've noticed that it's using up some of my openai credit, even though it's not finished yet. Could you help me troubleshoot this issue? Additionally, I was wondering if there's a way to print logs to track the progress of each step in bertopic.