Originally posted by **ovi97** July 20, 2022
Hello @MaartenGr. Thank you for making BERTopic so flexible to make Topic Modelling to fun to use.
Anyways I have several implementations of BERTopic using different Sentence Transformers, Dimensionality Reduction, and Clustering techniques.
However, I would love to evaluate these models numerically using different metrics. I have been able to calculate the Coherent scores using Gensim. But I want to calculate Topic Diversity, Pairwise Jaccard Similarity, and other diversity and similarity metrics using the OCTIS library.
However, when I checked their implementation here(https://colab.research.google.com/github/MIND-Lab/OCTIS/blob/master/examples/OCTIS_LDA_training_only.ipynb#scrollTo=18Ayd5ZaBrSp),
these measures need a dictionary of
1)'test-topic-document-matrix' ---> this is an array
2) 'topic-document-matrix' ---> this is an array
3) 'topic-word-matrix' ---> this is an array
4) topic -----> this is a list of words in each topic
I can get the parameters in 4, but can not get the matrices in the first 3.
What do these arrays mean and how can they be computed in BERTopic. Thanks
Discussed in https://github.com/MaartenGr/BERTopic/discussions/627