ethen8181 / machine-learning

:earth_americas: machine learning tutorials (mainly in Python3)
MIT License
3.19k stars 650 forks source link

Fix the probability formula for some topic given some words #9

Closed JiaxiangBU closed 4 years ago

JiaxiangBU commented 4 years ago

Hi, @ethen8181 I read your good intro for Gibbs Sampling. And I find a typo in one latex formula.

https://github.com/ethen8181/machine-learning/blob/1f71423da54bfde24de7528a3ef0f5c9e694f4b7/clustering_old/topic_model/LDA.Rmd#L149-L152

https://github.com/ethen8181/machine-learning/blob/1f71423da54bfde24de7528a3ef0f5c9e694f4b7/clustering_old/topic_model/LDA.Rmd#L198

Here is the thing, for the first iteration, you script does a random assignment for the first word in the first document. Here left * right with two values, the probability of topic 1 and that of topic 2. The sum of both is not equal to 1. It just displays the relative weight for this word between two topics. Thus, they need do a normalization.

I double-check the reference you list at the end of the article. The author does use \propto to show relative weights for topics.

Thus, I open a PR for further discussion.

ethen8181 commented 4 years ago

Thanks for the feedback.

  1. Regarding the latex formula, sure, changing it to proportional instead of equal makes sense.
  2. For the sample weight, from an implementation standpoint, does it matter whether we normalize it? R's sample accepts a probability weight that is not normalized?
ethen8181 commented 4 years ago

nitpick:

  1. can you add back the space between left and right, so left * right instead of left*right.
  2. It would be great if you could also generate the .HTML doc from the Rmarkdown file.

I can merge this once you resolve these two comments. Thanks.

JiaxiangBU commented 4 years ago

and notes and space for relative probability weight left * right, render LDA.Rmd and use the relative path for the scriptLDA_functions.R for reproducible purpose. @ethen8181 see 2dc23e2

ethen8181 commented 4 years ago

Thanks. merged.