TommyJones / tidylda

Implements an algorithim for Latent Dirichlet Allocation using style conventions from the [tidyverse](https://style.tidyverse.org/) and [tidymodels](https://tidymodels.github.io/model-implementation-principles/index.html).
Other
41 stars 3 forks source link

Revert to sequential Gibbs sampling #48

Closed TommyJones closed 2 years ago

TommyJones commented 3 years ago

Parallel approximate Gibbs as implemented in 5549ce1 presentes 3 problems:

  1. Parallel sampling - current implementation hits the R API from multiple threads which is unstable and a deal killer for CRAN. Most possible fixes I can think of further increase code complexity, are problematic for respecting R's set.seed(), and increase the number of dependencies.
  2. Model quality - I see a big drop off in R-squared and coherence (though visual inspection of top words in topics seems ok) when fitting models with parallel Gibbs
  3. Single-threaded speed - On my (very powerful) Ubuntu 20.04 machine, the single threaded sampler is slower than on my Macbook. Before this change, it was blazingly fast, at least for single threaded models.

This package has been nearly ready for a year and still isn't on CRAN. My goal is to revert and then get it on CRAN with a message that the API is still unstable. I will then (well in parallel, no pun intended) work on the Rust implementation of WarpLDA with the unique features I've added to this Gibbs sampler. A future version will either only use the Rust implementation or offer the chance to change the engine to the WarpLDA sampler.

Note that this will effectively nullify the need to address #20 and possibly #41

TommyJones commented 2 years ago

c39c005