Computational-Content-Analysis-2020 / Readings-Responses

Repository for organising "exemplary" readings, and posting reponses.
6 stars 1 forks source link

Discovering Higher-Level Patterns - Blei 2012 #31

Open jamesallenevans opened 4 years ago

jamesallenevans commented 4 years ago

Blei, David. 2012. “Probabilistic Topic Models"”. Communications of the ACM 55(4):77-84.

katykoenig commented 4 years ago

When describing the LDA algorithm, Blei notes that it corresponds to formula, where formula is the topic assignment for the nth word in document d and formula is the nth word in document d. Does this assume that all words (minus stop words) must belong to one topic?

laurenjli commented 4 years ago

I'm confused about the practical implementation of topic modeling. What does the actual output of the generative model look like if it's defining a joint probability? How do we transform it into an actual topic structure?

di-Tong commented 4 years ago

In terms of incorporating the meta data, both dynamic topic modeling and structural topic modeling (with temporal information) can be used to track temporal changes in topics and topic structures. What are the differences between these methods and which one is more likely to perform better?

Besides, there are quite a few popular evaluation metrics for topic modeling, such as perplexity score, topic coherence, etc. How can we choose between different evaluation metrics and methods?

lkcao commented 4 years ago

Can we say 2 words belong to the same topic means that they often appear together in one 'word bag', and do not appear alone? What is the difference between 2 words that are in different topic distributions, and 2 words that are in the same topic distribution but has a relatively low co-occurence? How does computer distinguish between those 2 situations?

tzkli commented 4 years ago

Hi @katykoenig , I guess the answer is no. The probabilistic model asks, given the presence of these words, what's the probability of this document belonging to this particular topic (distribution)? It's a reversal of this question: Given this topic structure, what's the probability of a word showing up in this particular document?

I'm wondering how the dynamic topic model works - it seems very useful if we're interested in the temporal variation in the topic structure of our corpus. The author merely touches upon it in the paper.

deblnia commented 4 years ago

LDA seems like a really lovely model for inferring topics from unstructured text, as are the many other dynamic models Blei and Lafferty worked on. More elegance doesn't necessarily equate more usefulness though: what the benefit is of using a dynamic/probabilistic model like LDA as opposed to a simple word count like tf-idf? Could we also elaborate on the research contexts in which other dynamic models are particularly useful?

bjcliang-uchi commented 4 years ago

Such a helpful paper! My question is about loosing the LDA "bag of words" assumption: after identifying topics, is it possible to detect each article's attitude toward its topic: for example, we have known this article is about abortion, but is it possible to know whether it is for-choice or for-life?

rachel-ker commented 4 years ago

Since the topic distribution seems to be assigned based on each word, how does this model handle word with multiple meanings? Does it read it in context of the other words that exists in the document? What about words with little substantial meaning for topical reference - are they ignored? Also, in LDA, the number of topics is assumed known and fixed across documents. I was wondering if there are any best practices in choosing a number of topics? e.g. proportion to document size. Or should we do this iteratively to check robustness?

sunying2018 commented 4 years ago

I have a similar question about model evaluation and checking as @di-Tong. Since there is a disconnect between how topic models are evaluated and why we expect models to be useful and there are different measures on model evaluation, how can we conduct model selection using different evaluation measures and if we use multiple measures, how we choose the model based on these measures if their results conflict?

skanthan95 commented 4 years ago

When working on the week 5 notebook, I saw groups of nodes naturally cluster by topic (in my dataset, participants were sorted into one of three different conditions, where they discussed a different topic in each one: GMOs, college admissions, or legal drinking age). What's the relationship between network analysis and LDA in this context, and how can they be used together to make inferences about the relationships between words and documents in a corpus?

I was also wondering how topic modeling takes word context into account (sarcasm, etc.). In a different class, we ran LDA on a music dataset, and some of the topic clusters that emerged were very unintuitive (Sean Paul and Michael Jackson songs showing up in the same cluster). Given that LDA is unsupervised, how do we interpret "confusing" clusters like this (i.e., understand how/why it sorted that way?)

luxin-tian commented 4 years ago

I am also curious about the question raised by @rachel-ker, how does the algorithm address potentially different meaning of words? As the article claims that "topic modeling can be adapted to many kinds of data. Among other applications, they have been used to find patterns in genetic data, images, and social networks". However, in the fundamental application of discovering the themes that run through the words, problems may occur if some words are borrowed by one topic from another. Even though common words are excluded, to what extent can the model capture such a problem in practice?

wunicoleshuhui commented 4 years ago

I'm also quite confused about the problems of similar words appearing in different topic word distributions. If the top words in different topics modeled in the "bag of words" approach are pretty much similar, how do we specifically know what distinguishes one topic from another?

gracefulghost31 commented 4 years ago

One of the main goals of topic modeling is "discovering and exploiting the hidden thematic structure in large archives of text." If the umbrella topic of topics to be discovered are selected before modeling, how do we mitigate selection bias?

YanjieZhou commented 4 years ago

I am personally very curious about the details of the algorithm, which as the paper suggests, delves into the complex pattern of the text. But when I deal with tons of texts, I find it particularly hard to apply content analysis for deeper meanings except simple emotional classifications. I am wondering how this could be implemented and what is the key factor to investigating processed tokens rather than whole sentences.

rkcatipon commented 4 years ago

When considering this reading on Topic Modeling and the previous two chapters on flat and HAC clustering, can anyone expound on what the main differences are between Topic Modeling and K-means clustering? I know they are both unsupervised machine learning models, but where do they diverge and when is it more appropriate to use one over another?

arun-131293 commented 4 years ago

In Topic Modelling, the number of topics is to be chosen beforehand. Although we can give it a range of numbers and run topic modelling for each number, what does it mean to find the ideal number of topics? After all, how many topics slice up the text into internally coherent but disparate strands is a inherently subjective process. Additionally so is determining the "name" of the bag of words which constitute one topic ("data analysis, genetics"), might not be so obvious. Do we rely heavily on subjective interpretation for topic modelling?

acmelamed commented 4 years ago

The information in this article on topic modeling helped a lot to expand my understanding of that method, especially regarding their interpretation. I am curious though about one of the assumptions of LDA listed in the article -- the "bag of words" assumption -- and why the authors insist that it is unrealistic given the progress of Griffiths et al. and Wallach towards solving that problem.

alakira commented 4 years ago

It was very exciting to read this article! As @bjcliang-uchi stated above, I have similar question on its restricted focus on counting words. Is there any way to combine LDA with other methods such as POS analysis?

HaoxuanXu commented 4 years ago

This is potentially a very important method in allowing for more effective querying of relevant articles and materials. I'd love to know LDA has the possibility to discern similar topic words as belonging to the same theme or different themes

kdaej commented 4 years ago

The article mentions many possibilities for topic modeling in different fields such as history, sociology, linguistics, political science, legal studies, comparative literature, and many others. Thinking about the potential of this method for these disciplines, I came to think about the implication of using topic modeling for categorizing interdisciplinary discourses. Recently, there has been a lot of effort to combine different fields. For instance, behavioral economics takes two different discourses, psychology and economics. If some articles on behavioral economics are taken to fit topic modeling, should it demonstrate about fifty percent of the psychology topic and another fifty percent of the economic topic? More broadly, can humans agree with the topics generated by this model?

ziwnchen commented 4 years ago

I'm interested in learning more about the "zoom-in/out" example the author describes at the beginning of the article. One problem of topic modeling is that it might be hard to name the topics discovered by the unsupervised topic modeling method. However, if we cannot name it, how could we link topics that are generated from different corpus together, especially if those topics are generated by different models? The author also briefly mentions that Bayesian topic models could be extended to build a hierarchy of topic trees. Such hierarchy gives me a feeling like pyramid, which is a discrete version of the "zoom in" story the author shows. But I'm wondering if different sub-topics in such a hierarchical topic tree really comparable. For example, if Chinese foreign policy is the immediate children of foreign policy, then is it comparable to football ( which is the immediate children of sports)?

sanittawan commented 4 years ago

I am new to topic models in general, but I am curious if there are any specific situations or data where topic models won't work, i.e. break down. What if you work on a data where the literature tells you there should be roughly 10 topics, but after you run topic models with 10 topics, you found that there are a lot of overlapping topics which may suggest that the actual number of topics could be less?

By the way, does it matter how we process the data like stemming or lemmatizing words?

chun-hu commented 4 years ago

In my experience with LDA, the model does not work well on documents that have short sentences and those do not share a similar theme. The topics generated are sometimes not intuitive, and there is no clear pattern in terms of each topic. I'm wondering if there is a hyperparameter tuning process in the algorithm, or we should just go along with other possible methods?

cytwill commented 4 years ago

This paper mentions some basic ideas about LDA and topic modeling. According to the authors' explanation, the results of the LDA are actually some posterior distributions of words in topics. So if there are cases that some words share high probabilities of being associated with more than one topic, how should we visualize this situation? Also, sometimes we have pre-defined topics of interest to explore, but some of these are not in the corpora (but intuitively have corresponding words in the documents), are there any solutions to get a reliable prior distribution of the topics in the document? At last, I hope to see examples where the final probabilities for each topic have been used as features in other machine learning tasks, and their advantages/disadvantages to serve as features in different tasks.

Lizfeng commented 4 years ago

Topic modeling is a powerful tool as it does not require any prior annotation or labeling of the documents. The topics emerge from the analysis of the original text. The challenge is to develop an efficient method to approximate the posterior distribution. After reading this article, I would like to know more about under what circumstances are we going to prefer the sampling-based algorithm and the variational algorithm otherwise?

VivianQian19 commented 4 years ago

After reading the article, I wonder is topic modeling mostly used when the corpus is too large and human labeling infeasible? And how to choose when to use topic modeling vs. classification models?