Hi, thanks for writing this awesome package, it really helped me grasp the idea of the collapsed gibbs sampler. Here's my attempt to give back to it.
The current implementation of the initial assignments of LDA iterates through the document-term matrix by rows and not taking into account the sparse nature of it, which makes it very slow in some circumstances (~50 minutes for a 800,000 x 20,000 case). I've modified the loop to exploit the sparse structure of the matrix by iterating through the non-zero rows of each column, this achieves a substantial improvement (the 800,000 x 20,00 case goes down to ~2 minutes).
Hi, thanks for writing this awesome package, it really helped me grasp the idea of the collapsed gibbs sampler. Here's my attempt to give back to it.
The current implementation of the initial assignments of LDA iterates through the document-term matrix by rows and not taking into account the sparse nature of it, which makes it very slow in some circumstances (~50 minutes for a 800,000 x 20,000 case). I've modified the loop to exploit the sparse structure of the matrix by iterating through the non-zero rows of each column, this achieves a substantial improvement (the 800,000 x 20,00 case goes down to ~2 minutes).