Open yuanjames opened 9 months ago
It is hard for me to optimize the hyperparameters as well.
I have run thousands of times to get the best categorization based on a multimodal dataset. The hyperparameters I tune are dim_model (PCA or UMAP), UMAP_random_state (to ensure the results are replicable), HDBSCAN_min_cluster_size, and min_topic_size; The performance recording included but not limited to: silhouette_score,Coherence, Jaccard_Distance, Hellinger_Distance, number_topics, standard_dev_of_n_topics, max_size_topics, min_size_topics, size_topics_outliers.
The interesting thing is that the performance of PCA is always less satisfying than UMAP, say, the number of topics is less than 20 or the number of topics assigned to -1 type exceeds 80% of my whole dataset; I am not sure whether this is due to my dataset.
However, with HDBSCAN, some results are hard to explain and compare. First, to my understanding, only when min_topic_size is larger than the HDBSCAN_min_cluster_size, the min_topic_size restriction would be meaningful. Then, when I set HDBSCAN_min_cluster_size as 10, and min_topic_size as 40, the min_topic_size calculated based on my output is actually 10; What's more, it's more a phenomenon than an incident as I run the model for hundreds of times and the min_topic_size in my results would sometimes be less than or equal to min{HDBSCAN_min_cluster_size,min_topic_size}, which suggest my restrictions does not work in place.
Also, sometimes changing hyperparameters does not change the results. There are some pairs of hyperparameters that give me exactly the same recording of the performance. For example, when I use UMAP(random_state == 0) and HDBSCAN_min_cluster_size as 10, the performance recording is, all the same, no matter how I change the min_topics_size (I change from 10 to 50)
My code is as below:
class BERTopicTrainer:
def __init__(self):
self.min_topic_size = 40
self.dim_model_str_name = "None"
self.HDBSCAN_min_cluster_size = 15
self.random_state = 42
self.dim_model = None
self.topic_model = None
self.sentence_model = SentenceTransformer("all-MiniLM-L6-v2")
global performance_record
def set_model(self, dim, random_state, HDBSCAN_min_cluster_size, min_topic_size):
visual_model = VisualRepresentation(nr_repr_images=3, image_squares=True)
representation_model = {"Visual_Aspect": visual_model}
self.dim_model_str_name = dim.lower()
self.HDBSCAN_min_cluster_size = HDBSCAN_min_cluster_size
self.min_topic_size = min_topic_size
self.random_state = random_state
if self.dim_model_str_name == "pca":
self.dim_model = PCA()
elif self.dim_model_str_name == "default" or self.dim_model_str_name == "umap":
self.dim_model = UMAP(random_state = self.random_state)
else:
print("This Model is Not Available")
hdbscan_model = HDBSCAN(min_cluster_size=HDBSCAN_min_cluster_size , metric='euclidean', cluster_selection_method='eom', prediction_data=True)
self.topic_model = BERTopic(
embedding_model = self.sentence_model,
umap_model = self.dim_model,
hdbscan_model = hdbscan_model,
representation_model=representation_model,
calculate_probabilities=True,
min_topic_size=self.min_topic_size
)
trainer = BERTopicTrainer()
docs, images = get_docs_images(valid_data)
for random_state in range(0,300,21):
for min_topic_size in range(10, 60, 10):
for HDBSCAN_min_cluster_size in range(4,50,2):
trainer.set_model(dim="default", random_state = random_state, HDBSCAN_min_cluster_size =
HDBSCAN_min_cluster_size, min_topic_size=min_topic_size)
topics, probs, topic_distr, info, frame = trainer.run_model(docs, images)
store_output(topics, probs, topic_distr, info, frame)
trainer.save_performance_record()
@yuanjames @Steven-1124 Thanks for sharing your thoughts and use cases! To start with, I think the best practices and parameter tuning pages are great ways to start with. If you have already seen those, let me go into the things you mentioned in a bit more detail.
@yuanjames
During hyperparameter optimization, I've identified four hyperparameters: min_topic_size, nr_topics, min_cluster_size, and min_samples, all of which potentially influence the determination of the number of topics. I'm wondering if these four hyperparameters have any priorities, meaning that if one is set at a certain value, the others will be ignored.
These four indeed all potentially affect the number of topics created. Do note though that if you using the min_cluster_size
of HDBSCAN, you can skip min_topic_size
. The min_topic_size
parameter is exactly the same parameter as min_cluster_size
but merely a nice way of controlling the min_cluster_size
without the need to use custom cluster models. This is chosen since many users are not that familiar with clustering techniques but still want to control its parameters.
If my understanding is correct, it seems that HDBSCAN runs first, and topics are generated within each cluster. If so, what role does min_topic_size play? Is it used to filter out topics if the number of documents within a cluster is insufficient?
As mentioned above, min_topic_size
is exactly the same as min_cluster_size
so it takes on the role of min_cluster_size
if you are not using a custom HDBSCAN model. To make this a bit more explicit, the following happens when you initialize BERTopic:
As shown above, when you do not use a custom HDBSCAN model, it creates one for you but still allows you to choose min_cluster_size
with the min_topic_size
parameter.
Another issue I'm encountering is related to outliers. Is there any way to handle or avoid outliers? It appears that outliers are not clustered. Consequently, in hyperparameter optimization, comparing BERTopic with other methods might not be fair as the number of documents could vary significantly due to the presence of outliers.
Outliers are an inherent part of HDBSCAN. If you want to reduce of entirely remove them, I would highly advise reading through this FAQ and the main method for removing outliers, namely .reduce_outliers
.
@Steven-1124
I have run thousands of times to get the best categorization based on a multimodal dataset. The hyperparameters I tune are dim_model (PCA or UMAP), UMAP_random_state (to ensure the results are replicable), HDBSCAN_min_cluster_size, and min_topic_size; The performance recording included but not limited to: silhouette_score,Coherence, Jaccard_Distance, Hellinger_Distance, number_topics, standard_dev_of_n_topics, max_size_topics, min_size_topics, size_topics_outliers.
As mentioned in one of the answers above, min_cluster_size
and min_topic_size
are the same parameters. If you set min_cluster_size
you will not use whatever is set in min_topic_size
.
The interesting thing is that the performance of PCA is always less satisfying than UMAP, say, the number of topics is less than 20 or the number of topics assigned to -1 type exceeds 80% of my whole dataset; I am not sure whether this is due to my dataset.
They are quite different algorithms so it is not surprising that one performs significantly worse than the other. There could be many reasons for that, including the parameters you set for PCA (e.g., nr dimensions) as well as HDBSCAN (e.g., min_sample_size
).
However, with HDBSCAN, some results are hard to explain and compare. First, to my understanding, only when min_topic_size is larger than the HDBSCAN_min_cluster_size, the min_topic_size restriction would be meaningful.
No, min_topic_size
will not be used if you are using HDBSCAN with min_cluster_size
. They are the same parameters but the former is used to allow users to skip over using custom cluster models, which often requires in-depth expertise about the model. I also explained this above in a bit more detail which hopefully helps the intuition behind this.
Then, when I set HDBSCAN_min_cluster_size as 10, and min_topic_size as 40, the min_topic_size calculated based on my output is actually 10; What's more, it's more a phenomenon than an incident as I run the model for hundreds of times and the min_topic_size in my results would sometimes be less than or equal to min{HDBSCAN_min_cluster_size,min_topic_size}, which suggest my restrictions does not work in place.
This relates to the above, min_topic_size
will not be used if you use min_cluster_size
(or any custom cluster model).
Also, sometimes changing hyperparameters does not change the results. There are some pairs of hyperparameters that give me exactly the same recording of the performance. For example, when I use UMAP(random_state == 0) and HDBSCAN_min_cluster_size as 10, the performance recording is, all the same, no matter how I change the min_topics_size (I change from 10 to 50)
See above.
Thanks so much for your quick response! Very clear! May I ask two more questions one of which is similar to the former post here. Besides the issues mentioned in the former post, I find one more specific issues related to probability distribution:
(1) there exists a doc that is assigned with a topic but its distribution over topics is 0. (first, I want to confirm that the distribution matrix gives all the docs over topics except the -1 topic, right? )
As you can see from the images, I have obtained 1603 outliers but 388 rows in the distribution matrix have a sum value equal to 0; What's more, the rows with sum_value == 0 do not match with the docs that are assigned to -1; And I do not know how to obtain the doc's probability over -1 topic.
(2) I want to give all the docs a topic. Instead of using the method of reduce_topics and other methods you mentioned. I want to do another BerTopic on the outliers from my first BerTopic results to avoid corroding my first results (I would like to ensure the best performance of the first Bertopic). The second time I chose to use OnlineCountVectorizer which could always give no outliers. However, the error constantly occurs that: AttributeError: 'NoneType' object has no attribute 'split' Then, I run with my original full dataset (which would be successfully processed by other forms of BerTopic), the error exists. Would you help check if is there a any problem with my parameters setting?
`
dim_model = PCA(n_components=5) cluster_model = MiniBatchKMeans(n_clusters=50, random_state=0) vectorizer_model = OnlineCountVectorizer(stop_words="english", decay=.01)
topic_model = BERTopic(umap_model=dim_model, hdbscan_model=cluster_model, vectorizer_model=vectorizer_model, representation_model=representation_model, calculate_probabilities = True, min_topic_size=40, nr_topics = 40)
topics, probs = topic_model.fit_transform(documents=docs, images=images)
`
@Steven-1124
As you can see from the images, I have obtained 1603 outliers but 388 rows in the distribution matrix have a sum value equal to 0; What's more, the rows with sum_value == 0 do not match with the docs that are assigned to -1; And I do not know how to obtain the doc's probability over -1 topic.
The probabilities calculated through .approximate_distribution
are calculated differently from the assignment process in .fit
or .fit_transform
. It was named approximate for exactly that purpose since it is merely an approximation and not an exact representation of the fitting procedure.
With respect to the probabilities, that might depend on the parameters that you used. I would advise reading through the documentation to get a feeling of the different parameters. There is also the min_similarity
parameter that might explain the differences that you get. Setting that to 0 will allow for any selection to be made. Setting that higher than 0 will result in fewer topics being assigned. The docstrings should give you more information about that.
dim_model = PCA(n_components=5) cluster_model = MiniBatchKMeans(n_clusters=50, random_state=0) vectorizer_model = OnlineCountVectorizer(stop_words="english", decay=.01)
topic_model = BERTopic(umap_model=dim_model, hdbscan_model=cluster_model, vectorizer_model=vectorizer_model, representation_model=representation_model, calculate_probabilities = True, min_topic_size=40, nr_topics = 40)
topics, probs = topic_model.fit_transform(documents=docs, images=images)
There are a number of things that should be changed here.
First, you are not performing online/incremental topic modeling here since you are using .fit_transform
instead of the suggested .partial_fit
function. Instead, I would highly advise reading through the documentation as it contains a number of examples on how to use online topic modeling.
Second, if you do not intend to use online/incremental topic modeling, then I would highly advise using the default models instead, namely UMAP, HDBSCAN, and CountVectorizer.
Third, I would advise skipping over nr_topics
here since you are already trying to control the number of clusters through your clustering model. Similarly, skip min_topic_size
since it is not actually doing anything.
Lastly, and based on your code, definitely go through the best practices first. It seems you are mixing models/methods that are not generally advised to do so. (Using OnlineCountVectorizer indicating online learning but then not using .partial_fit
nor using IncrementalPCA
). I think it is best to take a step back first.
@MaartenGr It is always helpful and nice when I dicuss with you, thank you so much. All clear.
@yuanjames Thanks, that is kind of you to say! If you ever run into any other issue, please let me know 😄
Thanks for the helpful discussion here!
I have a related question @ @Steven-1124 (but also the others if they have an opinion): to compute the silhouette scores, should I use
n_components=5
).Option 1 would be like:
embeddings = topic_model.embedding_model.embed(docs)
silhouette_score(X=embeddings, labels=topic_model.topics_)
Option 2 would be like:
silhouette_score(X=topic_model.umap_model.embedding_, labels=topic_model.topics_)
And why?
Thanks you a lot in advance!
@haukelicht There isn't a "right" solution to your problem. Both representations are used in the topic modeling process of BERTopic. The unreduced embeddings are generally the best representations of the documents themselves. However, the reduced embeddings are used during the clustering process and might be more representative of the evaluation metric.
Thank you, @MaartenGr. This makes a lot of sense!
HDBSCAN
Hi @MaartenGr, I am using the Kmeans as the clustering model. I was wondering min_topic_size is decided by n_cluster?
@yuanjames: For k-means, how many documents are assigned to a cluster ("topic") depends on n_clusters
and the distribution of your documents in the dimensionality-reduced embedding space. If you want to constraint topics' sizes with k-means, maybe Bradley et al. 2000 is useful (see also this towards data science post)
@MaartenGr May I ask if we use KMeans
instead of HDBSCAN
.
do we still need fit_transform()
to fit our topic_model
?
from bertopic import BERTopic
from sklearn.cluster import KMeans
cluster_model = KMeans(n_clusters=50)
topic_model = BERTopic(hdbscan_model=cluster_model)
@Rqcker Yes, you always need to run either .fit
or .fit_transform
regardless of the model. The only exception is when you load a pre-trained BERTopic model.
@yuanjames: For k-means, how many documents are assigned to a cluster ("topic") depends on
n_clusters
and the distribution of your documents in the dimensionality-reduced embedding space. If you want to constraint topics' sizes with k-means, maybe Bradley et al. 2000 is useful (see also this towards data science post)
Hi thanks for your reply, sorry for late reply as well. I checked the docs and run some experimensts, and I found that the parameter n_cluster in Kmeans actually is same to the number of topics. Additionally, I am not sure if the min_topic_size then is ignored.
Yes, min_topic_size is ignored whenever you use hdbscan_model
Hello, I am preparing a research paper, and I've come across a somewhat tricky issue.
During hyperparameter optimization, I've identified four hyperparameters: min_topic_size, nr_topics, min_cluster_size, and min_samples, all of which potentially influence the determination of the number of topics. I'm wondering if these four hyperparameters have any priorities, meaning that if one is set at a certain value, the others will be ignored.
If my understanding is correct, it seems that HDBSCAN runs first, and topics are generated within each cluster. If so, what role does min_topic_size play? Is it used to filter out topics if the number of documents within a cluster is insufficient?
Another issue I'm encountering is related to outliers. Is there any way to handle or avoid outliers? It appears that outliers are not clustered. Consequently, in hyperparameter optimization, comparing BERTopic with other methods might not be fair as the number of documents could vary significantly due to the presence of outliers.