Explanations on Stochastic (Gibbs) EM implementation for sLDA

dongwookim-ml / python-topic-model

Implementation of various topic models

Apache License 2.0

369 stars 172 forks source link

Explanations on Stochastic (Gibbs) EM implementation for sLDA #1

Closed judaschrist closed 8 years ago

judaschrist commented 8 years ago

Hi Dongwoo: I am currently looking for a gibbs sampling estimation method for supervised LDA, your Stochastic (Gibbs) EM for sLDA (slda_gibbs.py) is exactly what i'm looking for. I was wondering if there're any papers or other materials that can explain the math behind it, especially the matrix calculation part?

Many thanks!

dongwookim-ml commented 8 years ago

Hi Lingxiao,

Unfortunately, I didn't write down detail derivation, but I think this thread from the topic model mailing list and the attachment file are useful to derive the equations. https://lists.cs.princeton.edu/pipermail/topic-models/2011-February/001177.html https://lists.cs.princeton.edu/pipermail/topic-models/attachments/20110210/89b1646c/attachment-0001.pdf

judaschrist commented 8 years ago

Hi Dongwoo: Thanks for your reply! I went through the derivation and your code, still confused about something in slda_gibbs.py

1st, shouldn't line 79 be 'z_bar /= z_bar.sum(1)[:,np.newaxis]' like in line 92? Cause the summation here should be calculated per document (doc_topic_sum[di,:]) instead of per topic, am i right?

2nd, i'm not sure if i'm understanding the derivation correctly, but shouldn't the ratio between each normal distribution on the responce values involve the variance sigma? as declare in line 29 ' self.sigma = 1', which is never used in the code. What happen to the denominator of the exponent in the normal distribution 2 * sigma^2

dongwookim-ml commented 8 years ago

It is just a matter of dimension where you want to sum over. If you want to change the code like that I think you should also change the line 82 from np.dot(z_bar.T,self.eta)) to np.dot(z_bar,self.eta)).
Every response variable has the same variance. So it does not make any problem to ignore them in the sampling process because the ratio will be always the same. If you want to compute the full joint distribution, you need to consider the variances of course.

judaschrist commented 8 years ago

Dongwoo:

First, Accually i think you're right about line 82 as well as line 93 that np.dot(z_bar.T,self.eta) should be np.dot(z_bar,self.eta), but Line 79 should still be z_bar /= z_bar.sum(1)[:,np.newaxis]. According to the code z_bar is composed like this:

[count_topic1+1, count_topic2, ..... count_topicn], [count_topic1, count_topic2+1, ..... count_topicn], ... ... [count_topic1, count_topic2, ..... count_topicn+1]

so by z_bar.sum(1) you are summing up words in a document which gives you the word count in a document. But z_bar.sum(0) here simply makes no sense

Second: Since the probability is a Gaussian, with the variance inside the exponential function, you cannot ignore the variance. E.g., exp(a/2sigma) / exp(b/2sigma) is not equal to exp(a) / exp(b)

dongwookim-ml commented 8 years ago

Thanks for correcting me! you are right. Line 79 should be changed in that way. But is the line 82 also changed to np.dot(z.bar, self.eta)? I think i thought that i didn't care about the variance because I usually set it to 1.0. Anyway, it should be changed too. Could you change the code and make a pull request?

Thank you!