Discovering higher-level Patterns - Fundamentals

HyunkuKwon commented 3 years ago

Post questions here for one or more of our fundamentals readings:

Manning, Christopher, Prabhakar Raghavan and Hinrich Schütze. 2008. “Flat Clustering” and “Hierarchical Clustering.” Chapters 16 and 17 from Introduction to Information Retrieval.

Blei, David. 2012. “Probabilistic Topic Models”. Communications of the ACM 55(4):77-84.

RobertoBarrosoLuque commented 3 years ago

I have a two pronged question, first what is the state of the research in terms of automatizing labels for topic modeling? That is, is there any way to create labels for topics without requiring a human to generate them. Are there other topic modeling algorithms that don require the number of topics apriori, but rather infer the number of topics from the dataset itself?

hesongrun commented 3 years ago

The idea of applying EM algorithm to extract latent topics from a large corpus is really clever! It taps into the fact that many texts have some common themes which can be extracted by collocation of words within a documents. I have three questions:

For EM algorithm, we know that it is not guaranteed to reach global optimum. Different runs with different initial guess may yield different results. In this sense, how do we report this in academic writing? Is there some systematic ways to tune the model so that the resulting topics are transparent and has very high interpretability?
My second question is on determining optimal number of topics. From this week's lecture, I also learnt many ways to determine the optimal K. For example, we may rely on some information criterion like BIC or use cross validation to see the likelihood of the model on the hold-out data. I am wondering what if the measures are not consistent with each other? Which measures should we trust the most?
My third question is on labeling the topics. How do we justify our labeling in general? Although using the top words may be reasonable, I was thinking about some kinds of tf-idf approach. That is the words that are the most diagnostic of a topic may be those not occurring common across documents.

chiayunc commented 3 years ago

I am curious about LDA's performance for a corpus that has a relatively small vocabulary. In the example, we see the entire corpus considered covers topics that are very diverse. If we are looking at a highly concentrated field of corpora, where the size of the vocabulary is pretty small but is articulated/mediated in different ways(hence different focus/ topic), would LDA be an ideal way to perform topic modeling? or would topic modeling not be an ideal way to try to find the nuances between documents in this case?

jinfei1125 commented 3 years ago

Hi, though this has been introduced in the lecture and this week's fundamental reading, I still don't quite understand the following process for topic modeling:

Step 1: Randomly choose a distribution over topics. Step 2: For each word in the document a. Randomly choose a topic from the distribution over topics in step #1. b. Randomly choose a word from the corresponding distribution over the vocabulary.

Can you give some further explanation? How to "choose a distribution over topics" and how to " choose a topic from the distribution over topics" and how to "choose a word from the corresponding distribution over the vocabulary"......

k-partha commented 3 years ago

1) Are there any metrics to define the optimal number of topics for LDA? 2) How does LDA hold up in a world where a bag-of-words model is considered unrealistic compared to more recent NLP approaches which go beyond even n-grams? Are there any input transformations that are considered to improve LDA analyses? (including low level n-grams within the mix, converting the words to vector space representations, etc.)

toecn commented 3 years ago

How should we think about the integration of metadata and data for analysis when constructing topic models? What can metadata helps us do, validate or expand in terms of analysis?

Raychanan commented 3 years ago

The EM algorithm has a strong correlation with the Naive Bayes method. Does this mean that it can't actually be considered as a good algorithm anymore? Because I know that Naive Bayes is actually not considered as an advanced algorithm in the application area anymore.

I have another question: the EM algorithm seems to have a high degree of similarity to classification based on topic modeling. So I am curious what is the difference between them? When should EM algorithm be used instead of other classification methods?

romanticmonkey commented 3 years ago

Are there studies that discovered new population segmentations with text data through clustering methods in the existing literature?

jcvotava commented 3 years ago

I have a (perhaps embarrasingly) simple question about LDA and topic modeling: what is the relationship between number of clusters/topics formed (according to various unsupervised metrics mentioned in lecture, like the silhoutte formula) and number of documents? For example, imagine that Journal X ran 80 very similar papers on farm equipment, 10 papers on pasta recipes, 5 papers on neurobiology, and 5 papers on analytic philosophy. Despite organically having 4 clear topics, would the construction of any of these algorithms push artificially for more or fewer topics, or is number of documents already de-weighted in the formula? What kind of approach would be appropriate in this instance, or in an even more extreme case where very, very distinct topics had very few associated documents?

jacyanthis commented 3 years ago

What do you think of seeded LDA? When is it useful, and when is it not?

MOTOKU666 commented 3 years ago

Can you introduce some more ways to incorporate metadata into topic models? The paper briefly mentioned models of linguistic structure, models that account for distances between corpora, models of named entities. General-purpose methods for incorporating metadata into topic models include Dirichlet-multinomial regression models and supervised topic models. For example, How is the distance between corpora accounted?

zshibing1 commented 3 years ago

Is it possible to use unsupervised methods on corpora that have rigid structures (e.g., policy documents) but contain relatively fewer words (less than a million)?

ming-cui commented 3 years ago

The authors indicated that LDA is the simplest kind of topic model. I have read some sociology papers using LDA published in leading journals. So LDA seems also powerful and capable. Are there any other topic modeling techniques that we should take a look at?

Rui-echo-Pan commented 3 years ago

LDA is useful in analyzing the different topics compared among different texts; then what should we use to analyze the topic change in a long trend? Could we build such an analysis based on LDA ?

william-wei-zhu commented 3 years ago

How do we identify the optimal values for the tuning parameters "document - topic sparsity" and "topic-word sparsity"?

Bin-ary-Li commented 3 years ago

Is there any benchmark that compares different non-parametric clustering methods? I wonder if there is any consensus/common practice on "when to use/what to use/use on what" in applying clustering methods to data.

xxicheng commented 3 years ago

Could you please explain more about metadata? How does it affect?

sabinahartnett commented 3 years ago

It seems like topic models require a LOT of back and forth between researcher and machine to optimize - is there an automated way to do this? Or a 'best out of bag number' to start from?

theoevans1 commented 3 years ago

What kinds of considerations should be take into account in deciding between classification or clustering for a research question? Is there ever reason to use both methods together?

egemenpamukcu commented 3 years ago

I am also interested in some of the best-practices in determining the number of topics for LDA. Also, I would like to hear more about the approaches about comparing the clusters generated by an unsupervised learning method and predetermined classes (classified by an expert or has 'natural' categories). What would be some of the interesting applications of such a mixed approach?

lilygrier commented 3 years ago

Similarly to @zshibing1, I'm wondering about the feasibility of performing topic modeling on documents with rigid structures. Specifically, I'm thinking about executive orders or legislative bills, which tend to focus heavily on procedures and logistics and look similar regardless of policy context. Is there something analogous to TF-IDF that occurs in topic modeling to distinguish between very similar documents?

mingtao-gao commented 3 years ago

One of the main goals of topic modeling is "discovering and exploiting the hidden thematic structure in large archives of text." However, the umbrella topic of topics is selected before modeling, then how do we mitigate selection bias in this case?

dtanoglidis commented 3 years ago

About the LDA and its assumptions. There is a discussion about relaxing the assumptions made by the topic modeling algorithms. I was wondering the following: are all topics equally distinct? To rephrase it: Imagine we have two texts that each is composed of two topics, let's say that Airbnb reviews contain discussions about the location and the listing itself. But in the first text the discussion is more polarized, without any overlap between the topics while in the second one these two are intertwined. Is there a way LDA can distinguish between the two?

UChicago-CCA-2021 / Readings-Responses

Discovering higher-level Patterns - Fundamentals #27