TommyJones / tidylda

Implements an algorithim for Latent Dirichlet Allocation using style conventions from the [tidyverse](https://style.tidyverse.org/) and [tidymodels](https://tidymodels.github.io/model-implementation-principles/index.html).
Other
41 stars 3 forks source link

change algorithm for adding new vocab in refit.tidylda #50

Closed TommyJones closed 2 years ago

TommyJones commented 2 years ago

Proposed changed: add 0 weight for new vocabulary words. Logic is that since they don't appear in the old model, they should not have any expectation of arriving

An alternative would be to try and place them in some sort of rank order based on their frequency in the new corpus. But IMO that needs more theoretical research (which I am doing in my PhD) to be done first to have a hard assertion over a prior.

TommyJones commented 2 years ago

Setting this to zero caused faults in sampling probabilities. So it seems this should be set to a very small number.

My principal concern for a non-zero prior for new terms is that it messes with the prior weight. It might be better to set to a smaller number, like the lowest quantile or decile of non-zero elements...

TommyJones commented 2 years ago

Went with the lowest decile. It's still arbitrary, but at least it's closer to zero and any distortion on the prior weighting should be small (I hope).

ce90c9b