MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
6.19k stars 765 forks source link

How to calculate entropy with Bertopic #2186

Open bruceszq opened 1 month ago

bruceszq commented 1 month ago

Have you searched existing issues? 🔎

Desribe the bug

Hi everyone, first of all I would like to thank @MaartenGr and all the contributors for this amazing project. For my project, I need to calculate the entropy of each topic. Could you help me how to calculate entropy in Bertopic. I have used probs to calculate, but the bug showed that the probs were 1 dimension array. But my code requires two dimension array. Thank you very much!

Reproduction

import numpy as np import pandas as pd

doc_topic_matrix = np.array(probs)

normalized_doc_topic_matrix = doc_topic_matrix / doc_topic_matrix.sum(axis=1, keepdims=True)

topic_entropy = (-normalized_doc_topic_matrix * np.log2(normalized_doc_topic_matrix + 1e-9)).sum(axis=0)

entropy_df = pd.DataFrame({'Topic': range(len(topic_entropy)), 'Entropy': topic_entropy})

topic_freq['Entropy'] = sorted_entropy_df['Entropy'].values

BERTopic Version

0.16.4

MaartenGr commented 1 month ago

In order to get 2-dimensional probabilities, you would need to set calculate_probabilities=True when initialization BERTopic.