dselivanov / text2vec

Fast vectorization, topic modeling, distances and GloVe word embeddings in R.
http://text2vec.org
Other
850 stars 135 forks source link

Document-Topic Memberships #267

Closed fishdontfly closed 6 years ago

fishdontfly commented 6 years ago

What is the best way to determine the appropriate threshold to use for document-topic probabilities to tag documents to topics? I've noticed that the probabilities returned by fit_transform can sometimes be counter-intuitive.

For example, documents that are very text rich and contain multiple topics might have a probability of .25 of belonging to each of 4 topics. On the other hand, nonsensical and sparse documents could have a probability of .4 for one topic, because the normalized probabilities sum to 1.

dselivanov commented 6 years ago

I don't see the issue with first case - it looks like according to the model document belongs to 4 topics with equal proportions. If it is not the case according to human judgement it means you may need to try different hyper-parameters. The second case happen because as you mentioned probabilities normalized to 1. This is the case when doc-topic prior helps. By default transform and fit_transform don't add prior (which is not correct according to the LDA model definition, but gives much more sparse doc-topic assignments and works good in practice). So if your texts are short you may be interested to add priors (this will make model less confident about topic assignments, essentially this is regularization). Check code here - https://github.com/dselivanov/text2vec/blob/c3196d8655709c20d82f9946cf8a041d1c7f5364/R/model_LDA.R#L32-L34