MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
5.76k stars 716 forks source link

Compare LDA, NMF, LSA with BERTopic (w/ embedding: all-MiniLM-L6-v2 + dim_red: UMAP + cluster: HDBSCAN) #2009

Open abis330 opened 1 month ago

abis330 commented 1 month ago

Hi @MaartenGr ,

Given a dataset of texts, we want to extract topics using LDA, NMF, LSA and BERTopic (w/ embedding: all-MiniLM-L6-v2 + dim_red: UMAP + cluster: HDBSCAN).

In order to select the best algorithm for this dataset, there was an intuition that a mathematical combination of an applicable topic coherence measure and an applicable topic diversity measure was chosen to optimize. In one of previous issues, #90 , I observed that when calculating topic coherence, you treated concatenation of texts belonging to a cluster as a single document.

However, for calculating topic coherence for LDA, LSA and NMF, we simply get the BoW representation of given texts and calculate topic coherence.

To the best of my understanding, shouldn't we ensure that the corpus and dictionary passed to initialize CoherenceModel object from gensim.coherencemodel be the same between BERTopic and LSA/LDA/NMF, so that we can actually now compare values of topic coherence achieved for all algorithms and then select the one with highest topic coherence?

Apologies for such a long description.

Thanks, Abi

MaartenGr commented 1 month ago

To the best of my understanding, shouldn't we ensure that the corpus and dictionary passed to initialized CoherenceModel object from gensim.coherencemodel be the same between BERTopic and LSA/LDA/NMF, so that we can actually now compare values of topic coherence achieved for all algorithms and then select the one with highest topic coherence?

It depends. Although we typically would like to approach it with the same corpus/dictionary, that would also mean being constrained to the same types of representations as other models. Moreover, it also means that we are constrained by using the c-TF-IDF representations whereas you could also use other forms of representations in BERTopic. Personally, and as shown in the mentioned issue, I'm not particularly a big fan of optimizing BERTopic for coherence/diversity. Especially since it ignores all those additional representations that are integrated in the library. It's always interesting to see papers using BERTopic and using coherence on the default pipeline without considering MMR, KeyBERTInspired, PartOfSpeech, and even LLM-based representations.

Also, consider the following. Is the model with the highest coherence actually the best model? What is the definition of the best model in your particular use case? In all honesty, I highly doubt that optimizing for coherence/diversity is the answer here which is why I typically advise people to first find the metrics that fit with their use case. That might also mean that, and I hope it does, that human evaluation (for instance, with domain-experts), are considered or even your own validation.