MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
6.12k stars 763 forks source link

> > import numpy as np #687

Closed rubypnchl closed 1 year ago

rubypnchl commented 2 years ago

import numpy as np probability_threshold = 0.01 new_topics = [np.argmax(prob) if max(prob) >= probability_threshold else -1 for prob in probs]

This code indeed does not change the internal structure of BERTopic. You can see that the topic model was not referenced here. In order to update your model with the new topics, you will have to perform the following:

import pandas as pd
import numpy as np

# Extract new topics
probability_threshold = 0.01
new_topics = [np.argmax(prob) if max(prob) >= probability_threshold else -1 for prob in probs]

# Update the internal topic representation of the topics
# NOTE: You can skip this step if you do not want to change the topic representations
topic_model.update_topics(docs, new_topics)

# Update topic frequencies
documents = pd.DataFrame({"Document": docs, "Topic": new_topics})
topic_model._update_topic_size(documents)

._update_topic_size() has been separated from .update_topics() on purpose as updating topics was initially meant to update the topic representation only and not the corresponding frequencies. Doing so might make it less transparent as to what the model initially produced as output.

Hi MaartenGr,

Thank you for the great work, I currently using BERTopic for one of my problem. I am facing issues while updating the topics with the above code. My main aim is to reduce outliers but with quality of topics. I am also curious to know, could you please suggest me to calculate the percentage of outlier BERTopic produces (any method, standard or any suggestions from you).

I am getting the following error: image

PS: Thank you for the time saving library!

Best Regards,

Originally posted by @rubypnchl in https://github.com/MaartenGr/BERTopic/issues/529#issuecomment-1225986102

MaartenGr commented 2 years ago

Thank you for the great work, I currently using BERTopic for one of my problem. I am facing issues while updating the topics with the above code.

This might be related to the number of unique topics that you have in new_topics. It is important that the number of topics in topics are the same as those in new_topics. For example, if topics 1 through 5 are in topics those should also be in new_topics.

Other than that, it would help if you could share your code for doing this. Also, which version of BERTopic are you using?

I am also curious to know, could you please suggest me to calculate the percentage of outlier BERTopic produces (any method, standard or any suggestions from you).

That depends on the clustering model that you use. As a default, BERTopic uses HDBSCAN which indeed generates outliers. If you want to reduce those outliers, it might be worthwhile to read through the documentation here or use a model that does not create outliers, like k-Means.

rubypnchl commented 2 years ago

Thank you for quick response!

Thank you for the great work, I currently using BERTopic for one of my problem. I am facing issues while updating the topics with the above code.

This might be related to the number of unique topics that you have in new_topics. It is important that the number of topics in topics are the same as those in new_topics. For example, if topics 1 through 5 are in topics those should also be in new_topics.

Other than that, it would help if you could share your code for doing this. Also, which version of BERTopic are you using?

hdbscan_model = HDBSCAN(min_cluster_size=10, metric='euclidean', cluster_selection_method='eom', prediction_data=True, min_samples=5) topic_model = BERTopic(verbose=True, min_topic_size=5, nr_topics="auto", vectorizer_model=vectorizer_model, low_memory=True, calculate_probabilities=True, diversity=0.4, hdbscan_model=hdbscan_model) topics, probs = topic_model.fit_transform(abstracts,embeddings) topic_model.save("topic_model")

topic_model = BERTopic.load("topic_model") topics= topic_model._map_predictions(topic_model.hdbscanmodel.labels)

probs = hdbscan.all_points_membership_vectors(topic_model.hdbscan_model) probs = topic_model._map_probabilities(probs, original_topics=True)

probability_threshold = 0.01 new_topics = [np.argmax(prob) if max(prob) >= probability_threshold else -1 for prob in probs] topic_model.update_topics(abstracts, new_topics, vectorizer_model=vectorizer_model) documents = pd.DataFrame({"Document": abstracts, "Topic": new_topics}) topic_model._update_topic_size(documents)

I am also curious to know, could you please suggest me to calculate the percentage of outlier BERTopic produces (any method, standard or any suggestions from you).

That depends on the clustering model that you use. As a default, BERTopic uses HDBSCAN which indeed generates outliers. If you want to reduce those outliers, it might be worthwhile to read through the documentation here or use a model that does not create outliers, like k-Means.

Yes, I have read the documentation and also working on that and from the results I found that hdbscan gives the more qualitative topics as compare to kmeans. Therefore, I want to reduce the outlier in hdbscan using probability threshold. Meanwhile I found there is one line of code that is throwing error i.e. [ topic_model.update_topics(abstracts, new_topics, vectorizer_model=vectorizer_model)] and I am not understanding the clear reason because, I am new to BERTopic library.

Best Regards,

MaartenGr commented 2 years ago

new_topics = [np.argmax(prob) if max(prob) >= probability_threshold else -1 for prob in probs] topic_model.update_topics(abstracts, new_topics, vectorizer_model=vectorizer_model)

It might indeed be the case here that new_topics do not contain all the topics that you find in topics, so making they have the same sets of topics might resolve your issue.

Therefore, I want to reduce the outlier in hdbscan using probability threshold.

You can also significantly reduce the number of outliers by playing around with the min_samples and min_cluster_size parameters in HDBSCAN. Then, if the number of outliers are sufficiently reduced, you could use .reduce_topics to further reduce the number of clusters if too many micro clusters are created.

rubypnchl commented 2 years ago

new_topics = [np.argmax(prob) if max(prob) >= probability_threshold else -1 for prob in probs] topic_model.update_topics(abstracts, new_topics, vectorizer_model=vectorizer_model)

It might indeed be the case here that new_topics do not contain all the topics that you find in topics, so making they have the same sets of topics might resolve your issue.

Therefore, I want to reduce the outlier in hdbscan using probability threshold.

You can also significantly reduce the number of outliers by playing around with the min_samples and min_cluster_size parameters in HDBSCAN. Then, if the number of outliers are sufficiently reduced, you could use .reduce_topics to further reduce the number of clusters if too many micro clusters are created.

Hi MaartenGr,

Thanks a lot for your reply, I have tried this way as well but as you mentioned in https://maartengr.github.io/BERTopic/faq.html#how-do-i-reduce-topic-outliers , These documents seem forceful assignment.

My main aim is to get no outlier or least outlier but with good quality of topics, I have tried tuning different parameters from HDBSCAN(min_cluster_size=3, min_sample=1), UMAP(n_neighbours=[2,5], N_components=[2,5,10,15,75,100,200]), min_topic_size(from [1-20]) but I was able to reduce outlier percentage only up to 11.5% maximum with very low quality of topics. As I am using publication abstracts and these all abstracts play a crucial role in the further experimentation process of my project. I have the following queries, please help me if feasible. Q1. Each document has some topic discussed within it then why it is considered an outlier and percentage of outliers too high? Q2. Reduce outliers as least as possible because each document can be a crucial topic in the future. Q3. To maintain the high quality of topics. Q4. Any suggestion towards preprocessing after embedding creation? Q5. I have tried to segment abstracts into sentences but due to high dimensionality Kernal dies only on 0.1 M documents containing abstracts, any suggestions for it? I want to choose only hdbscan clustering as it doesn't require the number of clusters initialization.

I am really sorry for asking too many questions and many of these might sound stupid but this is my current requirement and I am entirely dependent on Bertopic now for the initial steps in my project. Kindly help!

Thank you

MaartenGr commented 2 years ago

My main aim is to get no outlier or least outlier but with good quality of topics,

If you want minimal or no outliers, then I would suggest using a different cluster model, like k-Means. You can use HDBSCAN to figure out the number of clusters you want and use that value for k-Means. You can find more about that here.

Q1. Each document has some topic discussed within it then why it is considered an outlier and percentage of outliers too high? Q2. Reduce outliers as least as possible because each document can be a crucial topic in the future.

Whether a document contains a topic is generally not relevant for actually extracting them. You can argue that every document contains a document since it talks at least about something. Here, it is about finding enough documents with the same topic. If there is a topic for which only one document exists, it cannot be clustered together with other documents and is therefore an outlier. The percentage of outliers depends on the parameters, such as min_samples and min_cluster_size together with the content of the documents. Running .visualize_documents can give you insights into which documents are outliers and why.

Q3. To maintain the high quality of topics.

Maintaining high-quality topics is a tricky subject. Not so much from an algorithm perspective but much more from an evaluation perspective. Your definition of "high-quality" is likely to be different from mine or anyone else. It highly depends on the use case what constitutes high quality. It can be topic coherence, diversity, domain interpretability, accuracy, etc.

Q4. Any suggestion towards preprocessing after embedding creation?

This also generally depends on your use case but most people have better results when you use CountVectorizer(stop_words="my_language") when instantiation BERTopic as it reduces quite a number of stop words. Similarly, you can look into some c-TF-IDF parameters parameters for improving representations.

Q5. I have tried to segment abstracts into sentences but due to high dimensionality Kernal dies only on 0.1 M documents containing abstracts, any suggestions for it?

You can find a bit more about memory issues in the FAQ page.

I want to choose only hdbscan clustering as it doesn't require the number of clusters initialization.

Although it does not require the number of clusters as initialization directly, indirectly it does so through its hyperparameters. Lowering min_cluster_size generally increases the number of topics that are being created. With HDBSCAN, I can easily generate 1000 topics by lowering min_cluster_size or only generate 5 topics by setting it a bit higher. This means that although HDBSCAN does not require a set number of clusters beforehand, you are still indirectly experimenting with the number of clusters it creates. From that perspective, it might be worthwhile to also look at other clustering algorithms.

rubypnchl commented 2 years ago

My main aim is to get no outlier or least outlier but with good quality of topics,

If you want minimal or no outliers, then I would suggest using a different cluster model, like k-Means. You can use HDBSCAN to figure out the number of clusters you want and use that value for k-Means. You can find more about that here.

Q1. Each document has some topic discussed within it then why it is considered an outlier and percentage of outliers too high? Q2. Reduce outliers as least as possible because each document can be a crucial topic in the future.

Whether a document contains a topic is generally not relevant for actually extracting them. You can argue that every document contains a document since it talks at least about something. Here, it is about finding enough documents with the same topic. If there is a topic for which only one document exists, it cannot be clustered together with other documents and is therefore an outlier. The percentage of outliers depends on the parameters, such as min_samples and min_cluster_size together with the content of the documents. Running .visualize_documents can give you insights into which documents are outliers and why.

Q3. To maintain the high quality of topics.

Maintaining high-quality topics is a tricky subject. Not so much from an algorithm perspective but much more from an evaluation perspective. Your definition of "high-quality" is likely to be different from mine or anyone else. It highly depends on the use case what constitutes high quality. It can be topic coherence, diversity, domain interpretability, accuracy, etc.

Q4. Any suggestion towards preprocessing after embedding creation?

This also generally depends on your use case but most people have better results when you use CountVectorizer(stop_words="my_language") when instantiation BERTopic as it reduces quite a number of stop words. Similarly, you can look into some c-TF-IDF parameters parameters for improving representations.

Q5. I have tried to segment abstracts into sentences but due to high dimensionality Kernal dies only on 0.1 M documents containing abstracts, any suggestions for it?

You can find a bit more about memory issues in the FAQ page.

I want to choose only hdbscan clustering as it doesn't require the number of clusters initialization.

Although it does not require the number of clusters as initialization directly, indirectly it does so through its hyperparameters. Lowering min_cluster_size generally increases the number of topics that are being created. With HDBSCAN, I can easily generate 1000 topics by lowering min_cluster_size or only generate 5 topics by setting it a bit higher. This means that although HDBSCAN does not require a set number of clusters beforehand, you are still indirectly experimenting with the number of clusters it creates. From that perspective, it might be worthwhile to also look at other clustering algorithms.

Hi MaartenGr,

Thank you so much for clearing my doubts. Also, I was able to reduce the outlier percentage with good-quality topics using your suggested tricks. Thanks a lot!