bab2min / tomotopy

Python package of Tomoto, the Topic Modeling Tool
https://bab2min.github.io/tomotopy
MIT License
548 stars 62 forks source link

HDP Model document-topic distribution and topic-word distribution does not sum to 1 #138

Open alexs131 opened 2 years ago

alexs131 commented 2 years ago

Hello, I have encountered an issue where the sum of the topic-word distribution also does not sum to 1. I am running version 0.12.1, with hyperparameters tw=TermWeight.PMI, gamma=1, alpha=0.1, eta=0.001, initial_k=20, seed=1. I have run the HDP model previously on a different, larger dataset, and did not encounter this issue.

Thanks for any help here and apologies if this is a misunderstanding on my part.

bab2min commented 2 years ago

Hi @alexs131 Thank you for reporting the bug. It seems to be a problem with floating point precision errors.

https://github.com/bab2min/tomotopy/blob/926f6ff34599a19d20b322f8b1a13fe66e8c5986/src/TopicModel/HDPModel.hpp#L493-L506

Currently, the numerator(doc.numByTopic) and denominator(doc.getSumWordWeight()) of topic distribution are stored separately, and it seems that errors in these values accumulate during the training process, especially on smaller dataset.

I'll investigate this problem more.