Adding TwitterLDA or any topic modeling for short texts

hosseinfani commented 2 years ago

We need to add a topic modeling method that are specifically for short texts, like TwitterLDA

hosseinfani commented 2 years ago

@soroush-ziaeinejad I explained this task to @farinamhz . She is going to implement this for us. Please help her in this regard.

soroush-ziaeinejad commented 2 years ago

@hosseinfani

Awesome! Sure.

@farinamhz Welcome to the club!

farinamhz commented 2 years ago

Thank you so much @soroush-ziaeinejad I will start and share my progress on this task soon. @hosseinfani

farinamhz commented 2 years ago

TO BE CONTINUED (It is just a draft)

Topic Modeling

Topic modeling is an unsupervised method for classifying documents. More specifically, it helps to have a better understanding of a collection of documents and also better search results through the documents. For example, if we have a document related to health and insurance, one of the results of a query which is "" probably will be the above-mentioned document because it covers those topics. We have two principles here; First, Every topic is a mixture of words. Second, every document is a mixture of topics.

What is LDA?

LDA stands for Latent Dirichlet Allocation which is a statistical model for topic modeling. LDA assumes documents are produced from a mixture of topics and the topics generate words based on their probabilities. To be more specific, LDA uses a backtracking method to recognize the topics which create the documents. In fact, it learns the relationships between words, topics, and documents.

How does LDA work?

There are 2 parts in LDA:

The words that belong to a document (we already know)
The words that belong to a topic or the probability of words belonging to a topic (we need to calculate)

Result of LDA:

Which words are more likely to be connected to specific topics? (Topic by Word Matrix)
Which topics are more likely to be connected to specific documents? (Document by Topic Matrix)

LDA converts this Document-Word matrix into two other matrices: Document-Topic matrix and Topic-Word matrix as shown below:

1_jOOsGb9QrOcGp3uDpHjB5Q

A good explanation from a blog:

LDA states that each document in a corpus is a combination of a fixed number of topics. A topic has a probability of generating various words, where the words are all the observed words in the corpus. These ‘hidden’ topics are then surfaced based on the likelihood of word co-occurrence. Formally, this is Bayesian Inference problem.

hosseinfani commented 2 years ago

@farinamhz this is good.

can we use lda to model users' topics of interest?
can we use lda to model reviews aspects?

farinamhz commented 2 years ago

@hosseinfani

Yes, we can do that and I have added this paper to my reading list because it seems that they have found each user’s interesting topics using LDA. Link to paper
I am still working on this topic.

farinamhz commented 2 years ago

Also, I was looking for an example of Twitter LDA and found an interesting explanation of LDA in a readme of a repository on GitHub. This is the repository link: https://github.com/t3abdulg/Twitter-Topic-Modelling I copied the explanation below for myself and also because I think it may help others in the future too:

Assume we had only 3 topics in the world, and that each topic is represented by a collection of probabilities of words that belong to them

Programming (0.10Python + 0.10C++ + 0.10*C .... = 1)
Animals (0.10Panda + 0.10Bear + .... = 1)
Video Games (0.10*Dota2 + 0.10LeagueofLegends + ... = 1)

Now if we had an article about programming: "Python is my favorite programming language. It is a dynamically typed language, whereas C++ is static and as a result you run into a lot more errors. However, pointers and references are life so I kinda like C++ a lot too"

As a Human, it is very obvious that this article is about Topic 1 (or programming) isnt it?

Well if we scrambled the text: "It errors. a lot programming However, and references a too static and typed kinda you is into pointers a more Python result run are is so I whereas like language. C++ life C++ a language, my dynamically as lot favorite is"

As a Human, we can still kind of tell what the text is about can't we?

LDA (latent dirichlet allocation) Works in a very similar way. We feed it a bag of words which it assumes to be related (syntax, order, punctuation all don’t matter!). And it tries to generate what the topics might have been, which made up these bag of words.

farinamhz commented 2 years ago

@hosseinfani @soroush-ziaeinejad My task is done. However, I did not know how to add the commits to this issue, so I pushed the commits and now I realized that I had to add the # of this issue to their message before committing to connect them to this issue. Now, is there any way to change the message of all of them? (They are 7 commits and I can not just change the last one) or do I need to change the head and commit them again from the beginning?

hosseinfani commented 2 years ago

@farinamhz thank you. I will review your code and let you know my comments. don't worry about the commit. next time attach the issue#.

hosseinfani commented 2 years ago

@farinamhz I reviewed the code. It looks readable. good job. hopefully, this was a good learning experience about topic modeling.

please do the following:

Add the required packages to the requirement.txt and environment.yml
Edit the readme for this addition. Also, you may find some parts that are old. Fix them too.
We have a colab script. Also, edit it accordingly and make sure it runs on the toy.synthesis for all available combinations. Please note that the colab script should be stand-alone and not depend on any external data/code.

let me know when you're done so I can test the new feature by the colab script.

hosseinfani commented 2 years ago

@farinamhz @soroush-ziaeinejad I made some code changes to the TopicModeling.py statically and left some questions inside the code. Have a look, fix, and test after.

hosseinfani commented 2 years ago

@farinamhz @soroush-ziaeinejad I believe we can close this issue. If any bugs are raised in future, either create new bug issue or attache this issue to the push.

fani-lab / SEERa