MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
6.21k stars 767 forks source link

Why BERTopic instead of LDA, NMF... #486

Closed TAsUjxnMIL closed 2 years ago

TAsUjxnMIL commented 2 years ago

Hi Maarten, could you tell me why BERTopic should be preferred over other topic modeling techniques like LDA and NMF. I know that these techniques are more time intensive because of hyperparameters which need to be set. But is there even an advantage regarding the performance? BERTopic uses BERT embeddings so contextual information are considered and this is not the case for LDA which uses as far as I know Bag-of-words for the text representation. Does this positively influence BERTopic's performance?

I'm interested in your opinion about that, I look forward to hearing from you. Thank you very much

MaartenGr commented 2 years ago

To understand a bit more about the benefits of BERTopic, I would advise you to read the paper where you can find some more information about the advantages and disadvantages of such a method.

Having said that, as with every topic model out there, there are pros and cons to BERTopic. Without going into too much detail here, there are a few advantages to BERTopic. As you mentioned before, leveraging contextual embeddings can capture the contextual nature of the text. The structure of BERTopic (embeddings, UMAP, HDBSCAN, c-TF-IDF), allows for a very flexible algorithm that can easily adapt to new advancements in language models, clustering techniques, dimensionality reduction techniques, etc. Thus far, every time a new sentence-transformer has been released, the resulting quality of topics has been increased (at least in my opinion). The same should apply to the other models. Moreover, and I believe this should not be underestimated, c-TF-IDF works quite well in extracting topic representations from clusters of documents without focusing on centroid-based extraction which has its share of problems.

However, that does not mean that BERTopic should always be preferred over LDA and NMF. For example, although BERTopic can perform different types of topic modeling methods (e.g., DTM), LDA has been the most used topic model for a reason, it is easy to implement and it has quite a number of variations that may suit your use case. Moreover, if you do not have a GPU, embedding documents can take too much time in BERTopic compared to LDA.

Generalizing the "no free lunch theorem" a bit, I believe that especially with topic modeling techniques that will not be a single technique that will outperform another in all use cases. To a certain extent, evaluating and interpreting topic modeling output can be quite subjective which is why I think that there will be plenty of use cases where LDA will outperform BERTopic and vice versa.

In other words, although I definitely enjoy seeing users use BERTopic I urge you to explore the classical methods, such as LDA and NMF, to compare how they perform in your specific use case. Although not "state-of-the-art", these models can create interesting and useful topic representations.

A bit longer than I had anticipated, and I am not sure how much structure there is to the above but I hope this helped a bit 😅

TAsUjxnMIL commented 2 years ago

Thank you very much, this helped me a lot. I have looked into your paper. Could you tell me the difference between topic representation and topic generation?

MaartenGr commented 2 years ago

No problem! I am not entirely sure there is a set definition of both but topic representation generally refers to the words in a topic themselves whereas topic generation refers to the creation of the topics, not necessarily the representation. In the case of BERTopic, the topic generation step can be seen as the clustering step whereas the topic representation step is achieved by using c-TF-IDF.