MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
6.13k stars 763 forks source link

Calculating Topic Diversity in BERTopic using OCTIS #628

Closed ovi97 closed 2 years ago

ovi97 commented 2 years ago

Discussed in https://github.com/MaartenGr/BERTopic/discussions/627

Originally posted by **ovi97** July 20, 2022 Hello @MaartenGr. Thank you for making BERTopic so flexible to make Topic Modelling to fun to use. Anyways I have several implementations of BERTopic using different Sentence Transformers, Dimensionality Reduction, and Clustering techniques. However, I would love to evaluate these models numerically using different metrics. I have been able to calculate the Coherent scores using Gensim. But I want to calculate Topic Diversity, Pairwise Jaccard Similarity, and other diversity and similarity metrics using the OCTIS library. However, when I checked their implementation here(https://colab.research.google.com/github/MIND-Lab/OCTIS/blob/master/examples/OCTIS_LDA_training_only.ipynb#scrollTo=18Ayd5ZaBrSp), these measures need a dictionary of 1)'test-topic-document-matrix' ---> this is an array 2) 'topic-document-matrix' ---> this is an array 3) 'topic-word-matrix' ---> this is an array 4) topic -----> this is a list of words in each topic I can get the parameters in 4, but can not get the matrices in the first 3. What do these arrays mean and how can they be computed in BERTopic. Thanks
MaartenGr commented 2 years ago

Thank you for posting this question. Seeing as there is a duplicate in the discussions, I will close this issue now in favor of that discussion.