Topic Document matrix from LDA contains uniform distribution.

BIDData / BIDMach

CPU and GPU-accelerated Machine Learning Library

BSD 3-Clause "New" or "Revised" License

916 stars 168 forks source link

Topic Document matrix from LDA contains uniform distribution. #132

Open manvendratomar opened 8 years ago

manvendratomar commented 8 years ago

I was training LDA model using BIDMach with default parameters. When i tried to retrieve topic-document matrix using nn.datamats(1) after setting putback to 1, it provided me matrix with uniform distribution. Is there some parameter tuning, i am missing out? Or am i retrieving it wrong way? Please if anyone can help.

akcom commented 7 years ago

I am having the same problem. While the topic / word result matrix looks reasonable, the document/topic matrix extracted via datamats(1) has a uniform distribution for all documents.

@jcanny - I tried poking around in the code, but from what I can tell it looks like this matrix would be updated in the CUDA code, which is way above my level. Any insight you can provide would be invaluable. Thank you.

akcom commented 7 years ago

As an update: I believe I've found the issue. The issue is the same in all the LDA model files, but I will use LDAgibbs.scala as an example. The function uupdate(), which updates the document/topic matrix (called user in this case) contains the following line:

def uupdate(sdata:Mat, user:Mat, ipass: Int, pos:Long):Unit = {
   //...
   val unew = user*0

   LDAgibbsv.LDAsample(mm, user, mnew, unew, preds, nsamps)

   //...

   user ~ unew + opts.alpha
}

The matrix is first set to zero, then it is updated via the LDAsample function (which ultimately calls the CUDA code) and then has alpha added.

Looking at the LDA update function in /jni/src/Samplers.cu (__LDA_Gibbs), it does not appear as though the B matrix (which corresponds to user) is being updated. Unfortunately I do not know enough about the implementation of gibbs sampling to figure out how this matrix would be updated. At least we've gotten closer to the issue :)

akcom commented 7 years ago

@DanielTakeshi I'd be more than happy to work on this, but I would need a bit of guidance on how to compute the posterior topic probabilities for the training documents. If you could provide some guidance on where to start, that would be fantastic!

I'm not a statistician by any stretch of the imagination, but I'm a pretty darn proficient C/Java/scala programmer. Is it possible to work off david Blei's original variational code or is that completely untranslatable to the gibbs sampler for computing the document/topic matrix?

DanielTakeshi commented 7 years ago

Hi @akcom

I would be happy to help you if I can with generic BIDMach concepts (e.g. code flow, etc) however I'm by no means the expert, as @jcanny implemented roughly 97% of this software. =)

I also didn't write the LDA Gibbs and I actually don't know how that works intuitively (despite having gone through David Blei's C code a few years ago).

I think Huasha Zhao's PhD thesis talked a bit about the LDA gibbs implementation, however I did not find it helpful at all.

The best shot is probably to start a new issue here (referencing this one) and the next time I see @jcanny (today, or next week) I can nudge him to check the issues pages. He checks it every now and then but he also gets maybe a thousand emails a day and cannot check each one in detail.

akcom commented 7 years ago

Awesome, thanks for the info @DanielTakeshi! I'll start up a new issue and get the ball rolling. I guess this weekend will be spent reading up on gibbs samplers and LDA models :)

akcom commented 7 years ago

Pull request resolves this.

@manvendratomar the following code pulls your predictions: // obtain the document x topic matrix val (mm, mopts) = LDA.predictor(nn.model, s) mm.predict

val preds = mm.preds(0)