Closed hosseinfani closed 2 years ago
@soroush-ziaeinejad I explained this task to @farinamhz . She is going to implement this for us. Please help her in this regard.
@hosseinfani
Awesome! Sure.
@farinamhz Welcome to the club!
Thank you so much @soroush-ziaeinejad I will start and share my progress on this task soon. @hosseinfani
Topic modeling is an unsupervised method for classifying documents. More specifically, it helps to have a better understanding of a collection of documents and also better search results through the documents. For example, if we have a document related to health and insurance, one of the results of a query which is "" probably will be the above-mentioned document because it covers those topics. We have two principles here; First, Every topic is a mixture of words. Second, every document is a mixture of topics.
LDA stands for Latent Dirichlet Allocation which is a statistical model for topic modeling. LDA assumes documents are produced from a mixture of topics and the topics generate words based on their probabilities. To be more specific, LDA uses a backtracking method to recognize the topics which create the documents. In fact, it learns the relationships between words, topics, and documents.
There are 2 parts in LDA:
Result of LDA:
LDA converts this Document-Word matrix into two other matrices: Document-Topic matrix and Topic-Word matrix as shown below:
A good explanation from a blog:
LDA states that each document in a corpus is a combination of a fixed number of topics. A topic has a probability of generating various words, where the words are all the observed words in the corpus. These ‘hidden’ topics are then surfaced based on the likelihood of word co-occurrence. Formally, this is Bayesian Inference problem.
@farinamhz this is good.
@hosseinfani
Yes, we can do that and I have added this paper to my reading list because it seems that they have found each user’s interesting topics using LDA. Link to paper
I am still working on this topic.
Also, I was looking for an example of Twitter LDA and found an interesting explanation of LDA in a readme of a repository on GitHub. This is the repository link: https://github.com/t3abdulg/Twitter-Topic-Modelling I copied the explanation below for myself and also because I think it may help others in the future too:
Assume we had only 3 topics in the world, and that each topic is represented by a collection of probabilities of words that belong to them
Now if we had an article about programming: "Python is my favorite programming language. It is a dynamically typed language, whereas C++ is static and as a result you run into a lot more errors. However, pointers and references are life so I kinda like C++ a lot too"
As a Human, it is very obvious that this article is about Topic 1 (or programming) isnt it?
Well if we scrambled the text: "It errors. a lot programming However, and references a too static and typed kinda you is into pointers a more Python result run are is so I whereas like language. C++ life C++ a language, my dynamically as lot favorite is"
As a Human, we can still kind of tell what the text is about can't we?
LDA (latent dirichlet allocation) Works in a very similar way. We feed it a bag of words which it assumes to be related (syntax, order, punctuation all don’t matter!). And it tries to generate what the topics might have been, which made up these bag of words.
@hosseinfani @soroush-ziaeinejad My task is done. However, I did not know how to add the commits to this issue, so I pushed the commits and now I realized that I had to add the # of this issue to their message before committing to connect them to this issue. Now, is there any way to change the message of all of them? (They are 7 commits and I can not just change the last one) or do I need to change the head and commit them again from the beginning?
@farinamhz thank you. I will review your code and let you know my comments. don't worry about the commit. next time attach the issue#.
@farinamhz I reviewed the code. It looks readable. good job. hopefully, this was a good learning experience about topic modeling.
please do the following:
let me know when you're done so I can test the new feature by the colab script.
@farinamhz @soroush-ziaeinejad I made some code changes to the TopicModeling.py statically and left some questions inside the code. Have a look, fix, and test after.
@farinamhz @soroush-ziaeinejad I believe we can close this issue. If any bugs are raised in future, either create new bug issue or attache this issue to the push.
We need to add a topic modeling method that are specifically for short texts, like TwitterLDA