JuliaText / TextAnalysis.jl

Julia package for text analysis
Other
374 stars 96 forks source link

Modify loop in initial assignments of lda to use sparse structure. #213

Closed jmoralez closed 4 years ago

jmoralez commented 4 years ago

Hi, thanks for writing this awesome package, it really helped me grasp the idea of the collapsed gibbs sampler. Here's my attempt to give back to it.

The current implementation of the initial assignments of LDA iterates through the document-term matrix by rows and not taking into account the sparse nature of it, which makes it very slow in some circumstances (~50 minutes for a 800,000 x 20,000 case). I've modified the loop to exploit the sparse structure of the matrix by iterating through the non-zero rows of each column, this achieves a substantial improvement (the 800,000 x 20,00 case goes down to ~2 minutes).