TommyJones / tidylda

Implements an algorithim for Latent Dirichlet Allocation using style conventions from the [tidyverse](https://style.tidyverse.org/) and [tidymodels](https://tidymodels.github.io/model-implementation-principles/index.html).
Other
41 stars 3 forks source link

use RcppThread to parallelize Gibbs sampler #11

Closed TommyJones closed 4 years ago

TommyJones commented 4 years ago

Make it run kind of like Mallet using Newman et al.'s algo

Also, keep the sequential algo in case someone sets to one core

TommyJones commented 4 years ago

When we get to the main Gibbs iteration for loop, we'd have an "if" statement that runs a single threaded Gibbs sampler or multi-threaded approximate Gibbs sampler, depending on how many CPUs the user chooses.

The multi-threaded sampler would also need to divide documents and Cd into batches and provide a batch copy of Cv and Ck. We'd parallelize over batches instead of documents. Then, within each batch, call the main sampling function called out in #10 passing the local objects. After each loop, we update Cv and Ck globally and copy those back to the local batches. (And we'd calc likelihood/optimize alpha/aggregate counts after burnin if the options are set.)

TommyJones commented 4 years ago

Another idea, possibly not a good one:

run parallel without creating batches or special objects. (Still need to transpose Cd for column major or make it a list.) Instead of updating a vector of topic probabilities (qz), initialize one on the fly for each thread and calculate the probabilities based on current values of Cd, Cv, and Ck.

Three problems I see with this:

  1. There is still a risk of collision. Specifically, probabilities might be slightly off based on ever fluctuating counts. That's probably not a big issue.
  2. A bigger issue might be a value in Cd or Cv going negative due to decrementing the same slot from multiple threads at once. Again, this could be due to collision
  3. The algorithm would now be well and truly stochastic regardless of any random seed one sets

The best way to handle (1) and (3) is to communicate to the user explicitly that if they need determinism to only use the single threaded version.

Best I've got for (2) is pray it doesn't happen. I could put in an "if" to say don't decrement if it's zero, but that could make the counts get off in a way that is impossible to fix. I could also put in a pause, but (a) there's no guarantee an entry would be non-zero after waiting and (b) it might never terminate if it reaches zero and just stays there. i.e. the program will keep waiting and never stop running

TommyJones commented 4 years ago

The idea above failed about as hard as it could have. First run crashed my R session. Subsequent runs pretty much always threw a negative probabilities errors, probably due to collisions where the same slot was subtracted more than once, resulting in Cv(k,v) going negative.

Trying a different approach...

TommyJones commented 4 years ago

@TommyJones, don't forget to verify that setting a random seed still works even with multithreading.

TommyJones commented 4 years ago

Gibbs is parallel as of 5549ce180054729f552e32725fe2d009ebe18b93

I'm going to open a new issue about verifying the sampler for multithreaded processes