TommyJones / tidylda

Implements an algorithim for Latent Dirichlet Allocation using style conventions from the [tidyverse](https://style.tidyverse.org/) and [tidymodels](https://tidymodels.github.io/model-implementation-principles/index.html).
Other
41 stars 3 forks source link

Change algorithm for setting prior on new topics in refit.tidylda #51

Open TommyJones opened 2 years ago

TommyJones commented 2 years ago

See approximately lines 268 to 285 https://github.com/TommyJones/tidylda/blob/4b92e5a603736b9650ce58295a6b5f249d6c8b89/R/refit.tidylda.R#L268

Currently, prior for new topics are just means of prior for old topics (after reweighing and vocabulary alignment). But that overweights tokens from the base model's training data. Theoretically, new topics should really just come from the new data. We are making the assumption that new topics are "new" and thus would only come from the new data.

This is a thorny issue with no obvious default. Might need to do more algebra/PhD research to get an opinionated solution here.