TommyJones / tidylda

Implements an algorithim for Latent Dirichlet Allocation using style conventions from the [tidyverse](https://style.tidyverse.org/) and [tidymodels](https://tidymodels.github.io/model-implementation-principles/index.html).
Other
41 stars 3 forks source link

Randomly shuffle document indices between iterations when in parallel #40

Closed TommyJones closed 3 years ago

TommyJones commented 3 years ago

My hypothesis is that this measure will help with convergence/poor fit when doing parallel Gibbs

The issue now is that Cv and Ck diverge when they go to each cluster and then are combined in the end. The result is a likelihood that doesn't show any signs of convergence and low R-squared. On the other hand, coherence seems OK.

But if we shuffle document indices that go to each core between iterations, my gut says that Cv and Ck won't diverge consistently between iterations. So, I'm thinking we might recover our guarantees of convergence (or something close) but convergence will be slower. The tradeoff is that with parallelism, we can rip through iterations faster.

TommyJones commented 3 years ago

Current implementation that I'm writing does not respect set.seed. Leaving the solution to that in #20

TommyJones commented 3 years ago

I'm going to switch out the sampler for one that is more natively parallel. Not messing with trying to optimize this one.