MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
6.14k stars 765 forks source link

openai topic representations are not finishing #1391

Open FahriBilici opened 1 year ago

FahriBilici commented 1 year ago

I'm currently working on an example, but it's taking longer than expected. I've noticed that it's using up some of my openai credit, even though it's not finished yet. Could you help me troubleshoot this issue? Additionally, I was wondering if there's a way to print logs to track the progress of each step in bertopic.

prompt="""
I have a topic that is described by the following keywords: [KEYWORDS]
In this topic, the following documents are a small but representative subset of all documents in the topic:
[DOCUMENTS]
Based on the information above, please give a description of this topic in the following format:
topic: <description>
"""

representation_model = OpenAI(model="gpt-3.5-turbo", delay_in_seconds=20,prompt=prompt, chat=True)

topic_model = BERTopic('english',nr_topics=5,representation_model=representation_model)
MaartenGr commented 1 year ago

Most likely, it is a result of using nr_topics. I believe it is iteratively aggregating topics. Generally, I would advise skipping over that parameter and controlling the number of topics with min_topic_size instead.

FahriBilici commented 1 year ago

İ was using “auto” otherwise it was almost 1000 different topics. How can i solve this if i dont use nr_topics?

MaartenGr commented 1 year ago

The min_topic_size describes the minimum size a topic can take. If you increase this value, then fewer topics can be created. If you decrease this value, more and smaller topics will be created. In other words, set min_topic_size to a large value, like 100 and test it out without OpenAI to see if the number of topics that you get make sense to your use case and adjust min_topic_size accordingly.

By the way, I just released a page and a Google Colab Notebook where you will find a bunch of best practices when using BERTopic. It contains a bunch of guidelines that generally result in a great performance and usability.

FahriBilici commented 1 year ago

if i use nr_topics=auto it takes around 15 minutes for generating topics but once i add representation model it doesnt stop. İ will check min topic size and you best practice case

mohammadm1985 commented 1 year ago

I've got the same issue. I am copying my model specifics here. The thing is dimensionality reduction and clustering steps finishes in less than 15 seconds, but the representation model which is a combination on the MMR and keyBertInspired is a nightmare now:

random_state = 14
um2  = UMAP(n_neighbors = 7, 
           n_components=2, 
           metric='cosine', 
           low_memory=False,
           angular_rp_forest=True,
           random_state=random_state)
reduced_embeddings = um2.fit_transform(embeddings)

umap_model_tr = UMAP(n_neighbors=7, 
                     n_components=50, #15 was good 
                     metric='cosine', 
                     low_memory=False,
                     angular_rp_forest=True,
                     random_state=random_state)

# Set prediction_data to True as it is needed to predict new points later on
hdbscan_model_tr = HDBSCAN(# min_cluster_size = 20, 
                           # max_cluster_size = 100,
                           min_samples = 1, 
                           metric='manhattan', 
                           cluster_selection_method='eom', 
                           prediction_data=True) 

topic_model = BERTopic(embedding_model = sentence_model, 
                       verbose = True,
                       n_gram_range = (1, 2),
                       ctfidf_model = ClassTfidfTransformer(reduce_frequent_words=True, bm25_weighting=True),
                       vectorizer_model= TfidfVectorizer(stop_words=SWV, ngram_range=(1, 2), vocabulary = vocabulary, min_df=2),
                       umap_model = umap_model_tr,
                       hdbscan_model = hdbscan_model_tr,
                       calculate_probabilities = True,
                       representation_model = [MaximalMarginalRelevance(diversity=0.1), KeyBERTInspired()]
                      )

topics, probs = topic_model.fit_transform(docs, embeddings)
n_outlier = topic_model.get_topic_info()[topic_model.get_topic_info()[topic_model.get_topic_info()["Topic"] == -1]["Count"][0]
print(f"Number of Outliers: {n_outlier}")
topic_model.visualize_documents(docs,
                                topics = topic_model.topics_,
                                embeddings = embeddings,
                                reduced_embeddings =  reduced_embeddings,
                                sample = 1,
                                hide_annotations = True,
                                hide_document_hover = False,
                                custom_labels = False,
                                  custom_labels = False,
                                title= "<b>Documents and Topics</b>",
                                width= 1500,
                                height= 750)

I have access to a server with 80 cores, but I don't know how can I parallelized the representation for each topic so it takes less time

MaartenGr commented 1 year ago

@mohammadm1985

The thing is dimensionality reduction and clustering steps finishes in less than 15 seconds, but the representation model which is a combination on the MMR and keyBertInspired is a nightmare now:

What do you mean by "nightmare"? Is it that it takes too long now? If so, how long?

Could you try it without the additional topic representations? Also, by setting min_samples=1, you are likely generating a very large number of topics which might suggest why it slows down for you. How many topics do you create? Lastly, what exactly is in sentence_model?

mohammadm1985 commented 1 year ago

@MaartenGr I've been running the model for 2 hours now and I don't get any results yet. I am using Jup notebook in JHUB and we are assigned about 80 cores, though it does not matter as the module is not coded for parallel analysis.

I've not seen the results for this model yet. The same model with half number of documents and min_cluster_size of 20 took 10 minutes to run. The number of docs is not high. just 4K docs. I am using all-mpnet-base-v2 for embeddings as I found it more reliable in my usecase. With min_sample greater than one I get too many noises and reduce noise results in distorted distributions. I'll try with higher numbers and update you regarding the speed.

mohammadm1985 commented 1 year ago

The run just finished. yeah I got 291 topics, which does not seem to be good. I'll work around min_sample. Also does it make sense for 291 topics to take 2 hours to generate the topics? Am I on the right track and just need to optimize it?

Also, is there any way that I can run keyBERTtinspired or any other representation model after I do the clustering? Like manually change it to MMR and see the results? I saw you can change the vectorizer but how about the representation model itself?

MaartenGr commented 1 year ago

The run just finished. yeah I got 291 topics, which does not seem to be good. I'll work around min_sample. Also does it make sense for 291 topics to take 2 hours to generate the topics? Am I on the right track and just need to optimize it?

With 4k documents, it generally should not take that long. Do you have a GPU enabled? Both MMR and KeyBERTInspired generated word embeddings and as such will need a GPU in order to quickly generate embeddings with sentence-transformers. You could also try to increase min_df if there happens to be too large of a vocabulary.

Also, is there any way that I can run keyBERTtinspired or any other representation model after I do the clustering? Like manually change it to MMR and see the results? I saw you can change the vectorizer but how about the representation model itself?

Yes you can, you can use .update_topics for that. You can update the topic representations without needing to rerun the entire topic model.

mohammadm1985 commented 1 year ago

@MaartenGr I use a tedious approach to filter my vocabulary with gensim word2vec dictionaries to limit to meaningful words and use keybert inspired approach to build the vocabulary and n_grams associated to that. That reduced my vocab size to 8000 which is not a lot. I have a thought though. I am using keyphrase_ngram_range =(1,2) and also ngram_range = (1,2) in my vectorizer. This makes me think maybe it is considering the combination of the phrases in my vocabulary... Unfortunately, I don't have access to GPU computational resources right now. Instead, the server provides 80 cores to parallelize the processes. I think the representation model can work independently on each topic-document pair and can be parallelized. Isn't that something you may consider adding to the package?

MaartenGr commented 1 year ago

Unfortunately, I don't have access to GPU computational resources right now.

I believe that is the main issue here. Both MMR and KeyBERTInspired create embeddings from your vocabulary and specific documents which is sped up with the GPU. Generally, it is not advised to use embedding models without a CPU.

I think the representation model can work independently on each topic-document pair and can be parallelized. Isn't that something you may consider adding to the package?

Seeing as documents/keywords are embedded which generally use torch, parallelization can be an issue. These are quite complex to parallelize, especially across the many backends that can be found in BERTopic.