clintpgeorge / tm

This repository has the implementations of different topic models such as Latent Dirichlet Allocation and Hierarchical Dirichlet Process
0 stars 0 forks source link

lda-hp: memory issue in storing and passing \theta, \beta, and \z #1

Open clintpgeorge opened 11 years ago

clintpgeorge commented 11 years ago

There was a couple of R system crashes, when I ran the LDA Gibbs sampling algorithm for a large set of documents ( test_lda_c.R ). This happened at the function call that handles the C++ - R transfer of objects such as betas (K x V x G matrix), and thetas (K x D x G matrix), and Z (N x G), where K is the number of topics, V is the vocabulary size, D is the number of documents in the corpus, G is the number of saved MCMC iterations, and N is the number of word instances in the corpus.

One solution would be to keep only z values ( Z matrix ) and computing the betas and thetas on demand, i.e., when we do the computation of likelihood ratios. For example, the function

compute_thetas <- function(did, Z, K, D, base.alpha.v)

in utils.R computes thetas from the stored Z matrix.

Note: This could be an issue with the way RcppArmadillo handles _cube_ data structure and Rcpp transfer it to R environment as an _array_ data structure.

Frequency: rare

clintpgeorge commented 11 years ago

I tried to avoid the passing of cube data structure. However, we need the theta and beta cubes for likelihood ratio computation. So I added the following functions to compute the same from _Z_ in utils.R

compute_thetas_betas <- function(did, wid, Z, K, D, V, base.alpha.v, base.eta)
compute_thetas <- function(did, Z, K, D, base.alpha.v)

It seems like these functions are not computationally efficient.

[TODO]: Need to find a better solution.