TommyJones / tidylda

Implements an algorithim for Latent Dirichlet Allocation using style conventions from the [tidyverse](https://style.tidyverse.org/) and [tidymodels](https://tidymodels.github.io/model-implementation-principles/index.html).
Other
41 stars 3 forks source link

Commonly get "log-probabilities have to be finite" error from `create_lexicon` #36

Closed TommyJones closed 4 years ago

TommyJones commented 4 years ago

Seems like sparsity may be an issue. But I need to track down what's causing this.

While I'm at it, good to track down a reprex.

TommyJones commented 4 years ago

Problem seems to come from some entry of phi_initial[k, v] that's equal to zero. This is caused by a sparse dirichlet parameter when calling gtools::rdirichlet that results in underflow.

I've reproduced this behavior with three different libraries.

set.seed(90210); 
gt_dir <- gtools::rdirichlet(n = 1000, alpha = rep(0.01, 14843))
summary(rowSums(gt_dir == 0))

set.seed(90210)
mc_dir <- MCMCpack::rdirichlet(n = 1000, alpha = rep(0.01, 14843))
summary(rowSums(mc_dir == 0))

set.seed(90210)
dr_dir <- DirichletReg::rdirichlet(n = 1000, alpha = rep(0.01, 14843))
summary(rowSums(dr_dir == 0))
TommyJones commented 4 years ago

possible patch is to add .Machine$double.eps to each draw. Will explore...

TommyJones commented 4 years ago

Unit test:

m <- tidylda(
    dtm = textmineR::nih_sample_dtm,
    k = 10,
    iterations = 20,
    burnin = 15,
    alpha = 0.05,
    beta = 0.01,
    optimize_alpha = FALSE,
    calc_likelihood = TRUE,
    calc_r2 = FALSE,
    return_data = FALSE
)

The above fails, specifically because beta is too sparse.

TommyJones commented 4 years ago

Fixed by adding machine epsilon to dirichlet draws for initialization: a42d38e531cb0324151c51aacea506088f9b645e