adjidieng / ETM

Topic Modeling in Embedding Spaces
MIT License
538 stars 126 forks source link

Most important topics interpretation #16

Open cuent opened 4 years ago

cuent commented 4 years ago

maybe this question is dumb but I don't understand why the average of the weighted document-topic-proportions is a metric for the most important topics?

thetaWeightedAvg = sums * theta
thetaWeightedAvg = thetaWeightedAvg  /  num_docs
print('\nThe 10 most used topics are {}'.format(thetaWeightedAvg.argsort()[::-1][:10]))

From my understanding, the product of each document frequency (sums) with document-topic probabilities theta amplifies or reduces probability-based on the actual probability. And the average provides some insights on which topics are important in the whole corpus. Is it right? Also, what would be the difference if we only average the document-topic proportions (no weighting)

460176980 commented 4 years ago

I think the best topic can be selected according to task requirements. For example, if you want the easiest to explain, you can choose the topic consistency index; or to better fit the data, you can choose the confusion index

cuent commented 4 years ago

Could you please elaborate on the consistency/confusion index? I thought it was a way of selecting the most used topics one by doc_frequency and topic proportion