MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
6k stars 752 forks source link

Finding optimal min_cluster size #1358

Closed bexy-b closed 11 months ago

MaartenGr commented 1 year ago

I'd love to help out but it seems that you did not provide a description. What seems to be the issue?

bexy-b commented 1 year ago

Hi there,

Apologies that was a typo - my message didn't send! I have two queries that I was wondering if you could help with:

1) I am using mlflow to explore and log different parameter combinations related to the dimensionality reduction and clustering steps of BERTopic. At the moment I am using topic coherence and the number of outliers as metrics , as well as inspecting topic representations, to try and find a good min_cluster size for HDBSCAN.

My problem is primarily that I have a dataset with classes of different sizes and the parameters for producing topics for one class might be very poor for another. For example, I have some classes with 40,000 datapoints and others with only 1000. I appreciate the subjective nature of trying to find meaningful topics, however I was hoping to try and automate this as much as possible. For example, exploring whether there is an optimal min_cluster size as a % of the class size. Would you recommend any particular metrics for this purpose i.e. any of the measures in the OCTIS package you have mentioned in other discussions?

2) My second query is somewhat related to the first. For a class of size 40,000, for example, having a smaller cluster size leads to too many topics to be able to make sense of visually/easily. However having played around with lots of the settings, it seems to me that it might be a good idea to try and find fairly granular topics as a first step and then apply the hierarchical_topics function to group the topics into a more manageable number. This way an end user can see a general overview and drill down as needed. My question is how the topic representation is calculated when you apply the hierarchical step? At the moment I'm using a combination of KeyBERT and MMR to fine-tune the granular representations. Essentially I am wondering whether it's better to just find larger topics to begin with, or whether the approach I have described above will yield the results I am hoping for.

Apologies for the very long questions!

Thanks again and I really appreciate how detailed all the documentation and discussion threads for BERTopic are.

MaartenGr commented 1 year ago

Would you recommend any particular metrics for this purpose i.e. any of the measures in the OCTIS package you have mentioned in other discussions?

OCTIS mostly focuses on the output topics themselves and to a lesser extent the generated clusters themselves. Not all topic modeling techniques approach it like it is a clustering task, so I do not think OCTIS would be appropriate here since you focus on the class distribution. It might be worthwhile to also consider different clustering algorithms that do take into account class distributions or that generate fewer outliers. This also brings me to a second point, namely reducing outliers as a way to increase the number of topics in each class. Do note though that it is not necessarily surprising that there are big differences in class distribution. It rarely happens that all topics are equally distributed across all documents. Having said that, if you expect the same distribution then that is something you can optimize for through Mlflow by defining a custom evaluation metric.

My question is how the topic representation is calculated when you apply the hierarchical step?

They are calculated by leveraging the pre-trained c-TF-IDF representations and applying them to the aggregated topics. So it is a re-calculation based on combined topic distributions.

At the moment I'm using a combination of KeyBERT and MMR to fine-tune the granular representations. Essentially I am wondering whether it's better to just find larger topics to begin with, or whether the approach I have described above will yield the results I am hoping for.

Depending on the granularity it might indeed be helpful to increase the size of the minimum cluster. If you only have a couple of documents it might make sense that a solid topic is difficult to extract and perhaps even impossible.

bexy-b commented 1 year ago

Thanks that makes sense! I did have another query about the online vs. dynamic topic modelling capabilities.

What I'm not sure I understand is the difference that these approaches will produce when looking at topics over time. For example, with dynamic topic modelling we train on the entire dataset and then we can adjust the evolution_tuning parameter to focus only on how topics evolved over time rather than their global representation. How would this representation differ to say using online topic modelling where the batches correspond to different time slices i.e. each batch of docs is from the same month. Would incrementally training on each batch produce a better representation of topic evolution over time or would they essentially yield the same results?

Thanks again.

MaartenGr commented 1 year ago

Dynamic vs. online topic modeling in BERTopic are very different procedures.

In dynamic topic modeling, we train on the entire dataset and then generate local representations of the topics. Local refers to the timestamps themselves.

In online topic modeling, we train on the entire dataset but in a sequential manner by giving it batches of data one at a time. As a result, the model will slowly become more accurate and learns better clustering. In other words, the very first batch can be assigned quite differently from the very last batch even if they were the same.

Moreover, since they both use different underlying algorithms, they tend to generate very different results. Generally, I would advise going with dynamic topic modeling since you model everything at once.

bexy-b commented 1 year ago

So training in batches that correspond to time slice doesn't mean you'd achieve a better topic representation for documents at time X than if you trained on everything at once and then looked at the local representation of X?

Sorry I do have yet another question (I really appreciate the help!) I'm a little lost as to how to change the representation model so that it can be used only to change the topic label rather than the representation? In other words, I've trained the model using KeyBERT and MMR to refine the representation and get a more interpretable list of keywords. Now I want to use that refined list to generate a topic label using the OpenAI API. So far I can only see that I can do it out-of-the box by using multi-aspect modelling and then repurposing the OpenAI representation as my labels using set_labels. Is there a neater way to do this by modifying what the class returns?

Thanks again!

MaartenGr commented 1 year ago

So training in batches that correspond to time slice doesn't mean you'd achieve a better topic representation for documents at time X than if you trained on everything at once and then looked at the local representation of X?

Indeed. Typically, training with the entire dataset results in more accurate representations.

So far I can only see that I can do it out-of-the box by using multi-aspect modelling and then repurposing the OpenAI representation as my labels using set_labels. Is there a neater way to do this by modifying what the class returns?

This is indeed how you would approach generating custom labels.

bexy-b commented 1 year ago

Great, thank you again!