Classifying Meanings & Documents - Orientation

HyunkuKwon commented 3 years ago

Post questions about the following orienting reading:

Hopkins, Daniel J. and Gary King. 2010. A Method of Automated Nonparametric Content Analysis for Social Science.American Journal of Political Science 54(1): 229-247.

RobertoBarrosoLuque commented 3 years ago

This method's obvious gain is how it efficiently and accurately allows for estimation of category proportions of large corpora. Yet these methods only work if the researchers know, apriori, what are the different categories present in the corpora. Are there any similar methods that allow a content analyst to gather insights into the number of categories and their relative proportion when such knowledge is unknown?

Raychanan commented 3 years ago

One of the prerequisites of this method is to properly define the categories in our research. Do you have any suggestions? Or what are some standards/experience/methods that you would draw on when defining categories in your own research?

Second, I think sometimes it’s really hard to say a document should fall into either of the two categories. For instance, there are two categories: women and rights. An essay titled “Women's Rights” may spend 40.2% of the words about women and 39.7% about rights. Technically, it’s easy to specify the essay into the “women” category, but it is actually also reasonable to say it belongs to “rights”. So how should we deal with this? I think this kind of situation is less common but I’m just curious about this. Thanks!

Willy624 commented 3 years ago

This paper is insightful cause it really takes the objective of social science into mathematical framework, instead of just using math tools as black boxes to produce dubious social science research. Still, I have a question about this paper's relevance in 2021. Specifically speaking, the reason this method is important is because of 75-85% classification accuracy. If instead we have 95-99% accuracy in big datasets, it seems that the aggregate misclassification problem would be trivial. Therefore, the 2 questions I have is:

Did sentiment prediction accuracy improve a lot from 2010(the year this paper was published), or is it still in the range of 75%-85%?
Was this paper's method widely used among academic research after it was published and are there some examples of its usage?

sabinahartnett commented 3 years ago

This was really interesting! In this automated nonparametric method, the authors introduce & explain the expression:

P(S) = P(S | D) * P(D)

where P(S) is the probability of a word stem profile occurring, P(S |D) is the probability of the word stem profile occurring within the documents in category D and P(D) is the probability of the document occurring (the quantity of interest here). Of course, with this example, P(S|D) needs to be assumed to be the same as in the hand-coded population....

Does this equation look similar to other models for predicting classification/category probability? And if not, does it makes sense to possibly apply it beyond word stem profiles (imagining using a vector of meta data information rather than word stem profiles(S))?
And additionally, does such a heavy reliance on and trust in P(S|D) (based on hand coders) and the strength of this model against SVM, polynomial and sigmoid kernel models signal that human-informed models commonly still outperform fully automated or unsupervised models?

egemenpamukcu commented 3 years ago

The approach authors suggest makes the assumption that population P(S|D) is the same as P(S|D) of the hand coded portion of the sample. I wanted to learn why this assumption is more acceptable and difficult to violate than the assumptions made by the direct sampling approach, which only relies on random sampling. Wouldn't direct sampling be the more straightforward and easier to do approach, especially in the context of content analysis where completely random selection from the entire population can be achieved in many cases?

Bin-ary-Li commented 3 years ago

Two questions:

How reliable is the assumption that $P^{h}(S|D) = P(S|D)$ , i.e., the labeled, hand-coded sample is the same as the population?
By using the law of total probability, the model assumes that document categories $D_j$ are mutually exclusive. When category memberships overlap, it will need to resort to the inclusion-exclusion principle. For example, say there are A, B, and C three categories and they overlap, then for the model to work it will need to predict on $D_j \in \{ \text{exclusively} A, B, C, A \cap B, A \cap C, B \cap C, A \cap B \cap C, \emptyset \}$ . Imagine there are n non-mutually exclusive document categories, then there are $\sum_{k=0}^{n} {n \choose k} = 2^n$ number of exclusive categories. Wouldn't that become computationally intractable?

jacyanthis commented 3 years ago

Similarly to @Willy624, I'm wondering when I would choose this method out of all the classification methods we have available. When would you use it? And in general, is there a decision tree anywhere that summarizes which classification methods are best for which research projects?

theoevans1 commented 3 years ago

This paper suggests hand coding around 500 documents, but also mentions that coding as few as 100 documents could be sufficient for some purposes (such as national surveys with a 4 percent margin of error). What considerations should be taken into account in deciding how many documents to code, or in determining an appropriate margin of error?

zshibing1 commented 3 years ago

Since the result that 500 documents are sufficient (figure 5) is based on the running example, is there a general way to calculate how many documents needed when using this method to reach a satisfactory level of confidence?

chiayunc commented 3 years ago

The idea proposed by this paper is indeed fascinating. I wonder if the method proposed is more suitable for texts that are closer to natural language? if the texts, also rich in social science content, needs to be rich in vocabulary as well? If we apply the method to say, a country's constitution, or sets of law, where the language is not as natural and the vocabulary is relatively much smaller than say a bunch of articles or blogs, what might be the implication and outlook?

xxicheng commented 3 years ago

I have two questions:

First, the authors mentioned that their method requires that the original documents contain the information needed. I guess under some circumstances the blogs we collected seem to be related to our topic, but actually not. For example, it probably only contains the keyword "presidential election" but actually talking about the weather. I am curious to know the preprocessing procedures to deal with irrelevant texts like these.

Secondly, in talking about the number of documents that needed to be hand-coded, figure 5 shows a minimal improvement as the numbers increase from 200 to 1000. Can this result be generalized? And does it mean that the numbers of coded documents in fact does not matter if we only consider the error rate? Thank you.

k-partha commented 3 years ago

It appears to me that one of the primary motivations behind some of the estimation techniques used in the paper - such as the word subset-wise calculation and subsequent averaging - was that a direct computation of P(D) using regression was computationally infeasible on most machines in 2010.

How has the relevancy of this paper changed with the incredible advances in computational power available nowadays on home systems and the availability of cloud-based HPC resources?

ming-cui commented 3 years ago

Thie article is interesting and useful. I just noticed that it has been published for almost ten years. Although it has been cited ~850 times, it seems that many of the articles did not use the package directly. Is this method/package still popular? Do any new relevant methods/packages/programs come out?

william-wei-zhu commented 3 years ago

Like many comments posted before, I am also interested in the impact and popularity of this article/method.

toecn commented 3 years ago

Sentiment analyses are, of course, important. Yet, there seems to be much more information in texts that we can use for research. For instance, what is the corpus showing on associations between a specific politician and relevant dimensions like national or internal security, poverty, environmental issues, ideas about minorities, immigration, and others? This path forward entails not thinking of content classification exclusively over a one-dimensional space (e.g., good-bad or positive-negative) but thinking of classifications in a multi-dimensional space. What limitations should we keep in mind when applying Hopkins and King's (2010) method for multi-dimensional classifications?

jcvotava commented 3 years ago

The technique described in this paper assumes that the categories of analysis are static over time, or at least that the language which would place a text in a category is the same over time. (To take the paper's example, it assumes 1) that "very positive/positive/neutral/negative/very negative" sentiments are universally valid categories, and 2) the language that would qualify a text to be included in one of the categories can be modeled statically.)

I'm curious what happens if the way language is used changes over the course of the model? To take Twitter discourse as an example, it actually seems plausible that new ways of talking about current events + politicians may arise, including complex forms of sarcasm and satire, that a static model is bound to struggle with. On the other hand, random sampling with human coding seems like it would be totally adequate to study such a corpus. Or is there a way that modelling can still be a more appealing alternative?

jinfei1125 commented 3 years ago

I am just a little curious that, because this nonparametric method requires "a reasonable number of documents in each category of D to be hand-coded", will the majority of researchers find this available without external help such as RAs or MTurk? Is it practical?

romanticmonkey commented 3 years ago

How did this paper played a role overall in the development of NLP/content analysis?

Rui-echo-Pan commented 3 years ago

I'm just curious that sentiment analysis seems to rely largely on hand-coding and I head that social scientists have a reservation on taking sentiment analysis as a computational method. I'm just wondering whether it is true, and what the difference is between sentiment analysis and other analysis methods in essence.

MOTOKU666 commented 3 years ago

Following @ming-cui 's idea, I'm also curious about how in reality people use this model since it's "It also requires far less work than projects based entirely on hand-coding, and much less work than most computer science methods of individual classification; it is both fast and can be used in real-time."

hesongrun commented 3 years ago

This paper does a very good job in modifying the objective of the off-the-shelf machine learning algorithms for social science purposes. In its essence, I think social science cares about unbiased estimates while machine learning focuses on prediction. This paper modifies estimates to remove the bias on the population distributional estimates. However, on the other hand, with the development of modern algorithms and availability of big-data, the ML sometimes does not face such a bias-prediction trade-off. More often we see that it can reduce bias and at the same time improve prediction performance.

Given this, I am wondering what do you think are appropriate scenarios for modifying machine learning objectives? Thanks!

lilygrier commented 3 years ago

I appreciated this application of machine learning techniques to a social science problem. Because this method requires a "reasonable" number of texts to hand-coded at the start, how do researchers use this in conjunction with the coder reliability methods we learned about last week? Also, because this method can be applied to large bodies of text, is it common that this method is scaled up into cloud and other distributed computing centers? Or do the computational gains from large-scale computing make it so less efficient but more accurate methods are favored today?

mingtao-gao commented 3 years ago

his paper provides a good design to construct a general and robust text classifier. As mentioned by @egemenpamukcu, my question is that if P(S|D) is not the same in the population and label datasets, how can we adjust for better reliability?

dtanoglidis commented 3 years ago

This paper provides an interesting approach to an alternative classification, focusing not in classes but instead in proportions. My question is the following: How is this procedure related to the problem of class imbalance? My understanding is that problems similar to those presented in this paper arise from highly imbalanced training sets, that can result in highly accurate classification but with low precision/recall. However, there are ways to mitigate these problems, for example by upsampling/downsampling the smaller/larger class. I'm curious if such an approach would work here.

UChicago-CCA-2021 / Readings-Responses

Classifying Meanings & Documents - Orientation #13