angeloskath / php-nlp-tools

Natural Language Processing Tools in PHP
Do What The F*ck You Want To Public License
742 stars 152 forks source link

Latent Dirichlet Allocation? #29

Closed balint42 closed 10 years ago

balint42 commented 10 years ago

Please forgive if this question might seem stupid but after playing with the LDA model I feel I less and less understand it's purpose: I was under the impression it would create "topics" not limited to single words - currently though I dont see how to achieve that using it. Also I want to mention I get as "topics" simply the most common prepositions... exactly what you would filter out as topic. I thus assume Im missing the point of how to use it? I would very much appreciate some advice and thanks for the great work otherwise!

angeloskath commented 10 years ago

The LDA model assumes the following: Any document is a collection of words that are generated by sampling firstly from a distribution of topics and then sampling from the topics distribution of words. Thus having a set of documents we can try to infer the above mentioned distributions.

You can see an example usage in the corresponding test file. You can also run it using phpunit NlpTools/Models/LdaTest.php from within the tests folder.

The function you are looking for is getWordsPerTopicsProbabilities or getPhi. It should return an array containing an array for each topic that will in turn contain the dirichlet distribution of the words for this topic as computed by Gibbs sampling.

Specifically from what you are writing above I assume you pass as the topic count parameter the number 1? Thank you for good words and for using the library. I hope I have been, at least, of some help.

balint42 commented 10 years ago

Thank you very much for your reply! I have indeed already read your tutorial, I have read the code and I have had a quick read of the publication. I am indeed using phi and theta or getWordsPerTopicsProbabilities and getDocumentsPerTopicsProbabilities.

Now more precisely my question is, again please forgive if it is naive: how to get the topics themselves, a description of them in the "space of words", by words? I understand that Phi is the word per topic probability, so lets say for topic "0" I could simply take the 10 most probable words as description. In my tests e.g. that would yield de - la en ... a le : link an - of course there are simple techniques to prepare texts by filtering out prepositions & certain chars, but my question basically is: am I doing sth. wrong, is there a way native to your library that allows for more meaningful topic descriptions? I appreciate very much your work on this, thanks once again!

P.S.: I have tested different parameters for topic count and a priori assumptions.

balint42 commented 10 years ago

Experimenting further with the LDA I ran into a more severe problem: the probabilities returned by the Theta function were all equal, thus wrong. I think I have found the source of the error, see my pull request. According to equ. 3 in the publication n(j)(d) is the count of how often topic j has been added to document d. Please review my changes & merge if you agree. Thanks!