MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
5.99k stars 752 forks source link

Hyperparameter Tuning #1642

Open yuanjames opened 9 months ago

yuanjames commented 9 months ago

Hello, I am preparing a research paper, and I've come across a somewhat tricky issue.

During hyperparameter optimization, I've identified four hyperparameters: min_topic_size, nr_topics, min_cluster_size, and min_samples, all of which potentially influence the determination of the number of topics. I'm wondering if these four hyperparameters have any priorities, meaning that if one is set at a certain value, the others will be ignored.

If my understanding is correct, it seems that HDBSCAN runs first, and topics are generated within each cluster. If so, what role does min_topic_size play? Is it used to filter out topics if the number of documents within a cluster is insufficient?

Another issue I'm encountering is related to outliers. Is there any way to handle or avoid outliers? It appears that outliers are not clustered. Consequently, in hyperparameter optimization, comparing BERTopic with other methods might not be fair as the number of documents could vary significantly due to the presence of outliers.

Steven-1124 commented 9 months ago

It is hard for me to optimize the hyperparameters as well.

I have run thousands of times to get the best categorization based on a multimodal dataset. The hyperparameters I tune are dim_model (PCA or UMAP), UMAP_random_state (to ensure the results are replicable), HDBSCAN_min_cluster_size, and min_topic_size; The performance recording included but not limited to: silhouette_score,Coherence, Jaccard_Distance, Hellinger_Distance, number_topics, standard_dev_of_n_topics, max_size_topics, min_size_topics, size_topics_outliers.

The interesting thing is that the performance of PCA is always less satisfying than UMAP, say, the number of topics is less than 20 or the number of topics assigned to -1 type exceeds 80% of my whole dataset; I am not sure whether this is due to my dataset.

However, with HDBSCAN, some results are hard to explain and compare. First, to my understanding, only when min_topic_size is larger than the HDBSCAN_min_cluster_size, the min_topic_size restriction would be meaningful. Then, when I set HDBSCAN_min_cluster_size as 10, and min_topic_size as 40, the min_topic_size calculated based on my output is actually 10; What's more, it's more a phenomenon than an incident as I run the model for hundreds of times and the min_topic_size in my results would sometimes be less than or equal to min{HDBSCAN_min_cluster_size,min_topic_size}, which suggest my restrictions does not work in place.

Also, sometimes changing hyperparameters does not change the results. There are some pairs of hyperparameters that give me exactly the same recording of the performance. For example, when I use UMAP(random_state == 0) and HDBSCAN_min_cluster_size as 10, the performance recording is, all the same, no matter how I change the min_topics_size (I change from 10 to 50)

My code is as below:

Definition of Trainer

class BERTopicTrainer:
def __init__(self):

    self.min_topic_size = 40
    self.dim_model_str_name = "None"
    self.HDBSCAN_min_cluster_size = 15
    self.random_state = 42
    self.dim_model = None
    self.topic_model = None
    self.sentence_model = SentenceTransformer("all-MiniLM-L6-v2")

    global performance_record 

def set_model(self, dim, random_state, HDBSCAN_min_cluster_size, min_topic_size):
    visual_model = VisualRepresentation(nr_repr_images=3, image_squares=True)
    representation_model = {"Visual_Aspect": visual_model}

    self.dim_model_str_name = dim.lower()
    self.HDBSCAN_min_cluster_size = HDBSCAN_min_cluster_size 
    self.min_topic_size = min_topic_size
    self.random_state = random_state

    if self.dim_model_str_name == "pca":
        self.dim_model = PCA()
    elif self.dim_model_str_name == "default" or self.dim_model_str_name == "umap":
        self.dim_model = UMAP(random_state = self.random_state)
    else:
        print("This Model is Not Available")

    hdbscan_model = HDBSCAN(min_cluster_size=HDBSCAN_min_cluster_size , metric='euclidean', cluster_selection_method='eom', prediction_data=True)

    self.topic_model = BERTopic(
        embedding_model = self.sentence_model,
        umap_model = self.dim_model,
        hdbscan_model = hdbscan_model,
        representation_model=representation_model,
        calculate_probabilities=True,
        min_topic_size=self.min_topic_size
    ) 

Training and Recording

trainer = BERTopicTrainer()
docs, images = get_docs_images(valid_data)

for random_state in range(0,300,21):
    for min_topic_size in range(10, 60, 10):
        for HDBSCAN_min_cluster_size in range(4,50,2):

            trainer.set_model(dim="default", random_state = random_state, HDBSCAN_min_cluster_size = 
            HDBSCAN_min_cluster_size, min_topic_size=min_topic_size)

            topics, probs, topic_distr, info, frame = trainer.run_model(docs, images)
            store_output(topics, probs, topic_distr, info, frame)

            trainer.save_performance_record()
MaartenGr commented 9 months ago

@yuanjames @Steven-1124 Thanks for sharing your thoughts and use cases! To start with, I think the best practices and parameter tuning pages are great ways to start with. If you have already seen those, let me go into the things you mentioned in a bit more detail.

@yuanjames

During hyperparameter optimization, I've identified four hyperparameters: min_topic_size, nr_topics, min_cluster_size, and min_samples, all of which potentially influence the determination of the number of topics. I'm wondering if these four hyperparameters have any priorities, meaning that if one is set at a certain value, the others will be ignored.

These four indeed all potentially affect the number of topics created. Do note though that if you using the min_cluster_size of HDBSCAN, you can skip min_topic_size. The min_topic_size parameter is exactly the same parameter as min_cluster_size but merely a nice way of controlling the min_cluster_size without the need to use custom cluster models. This is chosen since many users are not that familiar with clustering techniques but still want to control its parameters.

If my understanding is correct, it seems that HDBSCAN runs first, and topics are generated within each cluster. If so, what role does min_topic_size play? Is it used to filter out topics if the number of documents within a cluster is insufficient?

As mentioned above, min_topic_size is exactly the same as min_cluster_size so it takes on the role of min_cluster_size if you are not using a custom HDBSCAN model. To make this a bit more explicit, the following happens when you initialize BERTopic:

https://github.com/MaartenGr/BERTopic/blob/bcb3ca2ee0e691fe041da5db71bb076e2d5835e9/bertopic/_bertopic.py#L235-L240

As shown above, when you do not use a custom HDBSCAN model, it creates one for you but still allows you to choose min_cluster_size with the min_topic_size parameter.

Another issue I'm encountering is related to outliers. Is there any way to handle or avoid outliers? It appears that outliers are not clustered. Consequently, in hyperparameter optimization, comparing BERTopic with other methods might not be fair as the number of documents could vary significantly due to the presence of outliers.

Outliers are an inherent part of HDBSCAN. If you want to reduce of entirely remove them, I would highly advise reading through this FAQ and the main method for removing outliers, namely .reduce_outliers.

@Steven-1124

I have run thousands of times to get the best categorization based on a multimodal dataset. The hyperparameters I tune are dim_model (PCA or UMAP), UMAP_random_state (to ensure the results are replicable), HDBSCAN_min_cluster_size, and min_topic_size; The performance recording included but not limited to: silhouette_score,Coherence, Jaccard_Distance, Hellinger_Distance, number_topics, standard_dev_of_n_topics, max_size_topics, min_size_topics, size_topics_outliers.

As mentioned in one of the answers above, min_cluster_size and min_topic_size are the same parameters. If you set min_cluster_size you will not use whatever is set in min_topic_size.

The interesting thing is that the performance of PCA is always less satisfying than UMAP, say, the number of topics is less than 20 or the number of topics assigned to -1 type exceeds 80% of my whole dataset; I am not sure whether this is due to my dataset.

They are quite different algorithms so it is not surprising that one performs significantly worse than the other. There could be many reasons for that, including the parameters you set for PCA (e.g., nr dimensions) as well as HDBSCAN (e.g., min_sample_size).

However, with HDBSCAN, some results are hard to explain and compare. First, to my understanding, only when min_topic_size is larger than the HDBSCAN_min_cluster_size, the min_topic_size restriction would be meaningful.

No, min_topic_size will not be used if you are using HDBSCAN with min_cluster_size. They are the same parameters but the former is used to allow users to skip over using custom cluster models, which often requires in-depth expertise about the model. I also explained this above in a bit more detail which hopefully helps the intuition behind this.

Then, when I set HDBSCAN_min_cluster_size as 10, and min_topic_size as 40, the min_topic_size calculated based on my output is actually 10; What's more, it's more a phenomenon than an incident as I run the model for hundreds of times and the min_topic_size in my results would sometimes be less than or equal to min{HDBSCAN_min_cluster_size,min_topic_size}, which suggest my restrictions does not work in place.

This relates to the above, min_topic_size will not be used if you use min_cluster_size (or any custom cluster model).

Also, sometimes changing hyperparameters does not change the results. There are some pairs of hyperparameters that give me exactly the same recording of the performance. For example, when I use UMAP(random_state == 0) and HDBSCAN_min_cluster_size as 10, the performance recording is, all the same, no matter how I change the min_topics_size (I change from 10 to 50)

See above.

Steven-1124 commented 9 months ago

Thanks so much for your quick response! Very clear! May I ask two more questions one of which is similar to the former post here. Besides the issues mentioned in the former post, I find one more specific issues related to probability distribution:

(1) there exists a doc that is assigned with a topic but its distribution over topics is 0. (first, I want to confirm that the distribution matrix gives all the docs over topics except the -1 topic, right? )

image

Screenshot 2023-11-24 at 15 46 59 Screenshot 2023-11-24 at 15 47 55

As you can see from the images, I have obtained 1603 outliers but 388 rows in the distribution matrix have a sum value equal to 0; What's more, the rows with sum_value == 0 do not match with the docs that are assigned to -1; And I do not know how to obtain the doc's probability over -1 topic.

(2) I want to give all the docs a topic. Instead of using the method of reduce_topics and other methods you mentioned. I want to do another BerTopic on the outliers from my first BerTopic results to avoid corroding my first results (I would like to ensure the best performance of the first Bertopic). The second time I chose to use OnlineCountVectorizer which could always give no outliers. However, the error constantly occurs that: AttributeError: 'NoneType' object has no attribute 'split' Then, I run with my original full dataset (which would be successfully processed by other forms of BerTopic), the error exists. Would you help check if is there a any problem with my parameters setting?

My code:

`

dim_model = PCA(n_components=5) cluster_model = MiniBatchKMeans(n_clusters=50, random_state=0) vectorizer_model = OnlineCountVectorizer(stop_words="english", decay=.01)

topic_model = BERTopic(umap_model=dim_model, hdbscan_model=cluster_model, vectorizer_model=vectorizer_model, representation_model=representation_model, calculate_probabilities = True, min_topic_size=40, nr_topics = 40)

topics, probs = topic_model.fit_transform(documents=docs, images=images)

`

MaartenGr commented 9 months ago

@Steven-1124

As you can see from the images, I have obtained 1603 outliers but 388 rows in the distribution matrix have a sum value equal to 0; What's more, the rows with sum_value == 0 do not match with the docs that are assigned to -1; And I do not know how to obtain the doc's probability over -1 topic.

The probabilities calculated through .approximate_distribution are calculated differently from the assignment process in .fit or .fit_transform. It was named approximate for exactly that purpose since it is merely an approximation and not an exact representation of the fitting procedure.

With respect to the probabilities, that might depend on the parameters that you used. I would advise reading through the documentation to get a feeling of the different parameters. There is also the min_similarity parameter that might explain the differences that you get. Setting that to 0 will allow for any selection to be made. Setting that higher than 0 will result in fewer topics being assigned. The docstrings should give you more information about that.

dim_model = PCA(n_components=5) cluster_model = MiniBatchKMeans(n_clusters=50, random_state=0) vectorizer_model = OnlineCountVectorizer(stop_words="english", decay=.01)

topic_model = BERTopic(umap_model=dim_model, hdbscan_model=cluster_model, vectorizer_model=vectorizer_model, representation_model=representation_model, calculate_probabilities = True, min_topic_size=40, nr_topics = 40)

topics, probs = topic_model.fit_transform(documents=docs, images=images)

There are a number of things that should be changed here.

First, you are not performing online/incremental topic modeling here since you are using .fit_transform instead of the suggested .partial_fit function. Instead, I would highly advise reading through the documentation as it contains a number of examples on how to use online topic modeling.

Second, if you do not intend to use online/incremental topic modeling, then I would highly advise using the default models instead, namely UMAP, HDBSCAN, and CountVectorizer.

Third, I would advise skipping over nr_topics here since you are already trying to control the number of clusters through your clustering model. Similarly, skip min_topic_size since it is not actually doing anything.

Lastly, and based on your code, definitely go through the best practices first. It seems you are mixing models/methods that are not generally advised to do so. (Using OnlineCountVectorizer indicating online learning but then not using .partial_fit nor using IncrementalPCA). I think it is best to take a step back first.

yuanjames commented 9 months ago

@MaartenGr It is always helpful and nice when I dicuss with you, thank you so much. All clear.

MaartenGr commented 9 months ago

@yuanjames Thanks, that is kind of you to say! If you ever run into any other issue, please let me know 😄

haukelicht commented 9 months ago

Thanks for the helpful discussion here!

I have a related question @ @Steven-1124 (but also the others if they have an opinion): to compute the silhouette scores, should I use

  1. the (pre-dimensionality reduction) document/image embeddings
    or
  2. the reduced representations (e.g., the 5-dimensional PC representation after applying PCA with n_components=5).

Option 1 would be like:

embeddings = topic_model.embedding_model.embed(docs)
silhouette_score(X=embeddings, labels=topic_model.topics_)

Option 2 would be like:

silhouette_score(X=topic_model.umap_model.embedding_, labels=topic_model.topics_)

And why?

Thanks you a lot in advance!

MaartenGr commented 9 months ago

@haukelicht There isn't a "right" solution to your problem. Both representations are used in the topic modeling process of BERTopic. The unreduced embeddings are generally the best representations of the documents themselves. However, the reduced embeddings are used during the clustering process and might be more representative of the evaluation metric.

haukelicht commented 9 months ago

Thank you, @MaartenGr. This makes a lot of sense!

yuanjames commented 9 months ago

HDBSCAN

Hi @MaartenGr, I am using the Kmeans as the clustering model. I was wondering min_topic_size is decided by n_cluster?

haukelicht commented 9 months ago

@yuanjames: For k-means, how many documents are assigned to a cluster ("topic") depends on n_clusters and the distribution of your documents in the dimensionality-reduced embedding space. If you want to constraint topics' sizes with k-means, maybe Bradley et al. 2000 is useful (see also this towards data science post)

Rqcker commented 9 months ago

@MaartenGr May I ask if we use KMeans instead of HDBSCAN.

do we still need fit_transform() to fit our topic_model?

from bertopic import BERTopic
from sklearn.cluster import KMeans

cluster_model = KMeans(n_clusters=50)
topic_model = BERTopic(hdbscan_model=cluster_model)
MaartenGr commented 9 months ago

@Rqcker Yes, you always need to run either .fit or .fit_transform regardless of the model. The only exception is when you load a pre-trained BERTopic model.

yuanjames commented 9 months ago

@yuanjames: For k-means, how many documents are assigned to a cluster ("topic") depends on n_clusters and the distribution of your documents in the dimensionality-reduced embedding space. If you want to constraint topics' sizes with k-means, maybe Bradley et al. 2000 is useful (see also this towards data science post)

Hi thanks for your reply, sorry for late reply as well. I checked the docs and run some experimensts, and I found that the parameter n_cluster in Kmeans actually is same to the number of topics. Additionally, I am not sure if the min_topic_size then is ignored.

MaartenGr commented 9 months ago

Yes, min_topic_size is ignored whenever you use hdbscan_model