Open HyunkuKwon opened 3 years ago
This method's obvious gain is how it efficiently and accurately allows for estimation of category proportions of large corpora. Yet these methods only work if the researchers know, apriori, what are the different categories present in the corpora. Are there any similar methods that allow a content analyst to gather insights into the number of categories and their relative proportion when such knowledge is unknown?
One of the prerequisites of this method is to properly define the categories in our research. Do you have any suggestions? Or what are some standards/experience/methods that you would draw on when defining categories in your own research?
Second, I think sometimes it’s really hard to say a document should fall into either of the two categories. For instance, there are two categories: women and rights. An essay titled “Women's Rights” may spend 40.2% of the words about women and 39.7% about rights. Technically, it’s easy to specify the essay into the “women” category, but it is actually also reasonable to say it belongs to “rights”. So how should we deal with this? I think this kind of situation is less common but I’m just curious about this. Thanks!
This paper is insightful cause it really takes the objective of social science into mathematical framework, instead of just using math tools as black boxes to produce dubious social science research. Still, I have a question about this paper's relevance in 2021. Specifically speaking, the reason this method is important is because of 75-85% classification accuracy. If instead we have 95-99% accuracy in big datasets, it seems that the aggregate misclassification problem would be trivial. Therefore, the 2 questions I have is:
This was really interesting! In this automated nonparametric method, the authors introduce & explain the expression:
P(S) = P(S | D) * P(D)
where P(S) is the probability of a word stem profile occurring, P(S |D) is the probability of the word stem profile occurring within the documents in category D and P(D) is the probability of the document occurring (the quantity of interest here). Of course, with this example, P(S|D) needs to be assumed to be the same as in the hand-coded population....
Does this equation look similar to other models for predicting classification/category probability? And if not, does it makes sense to possibly apply it beyond word stem profiles (imagining using a vector of meta data information rather than word stem profiles(S))?
And additionally, does such a heavy reliance on and trust in P(S|D) (based on hand coders) and the strength of this model against SVM, polynomial and sigmoid kernel models signal that human-informed models commonly still outperform fully automated or unsupervised models?
The approach authors suggest makes the assumption that population P(S|D) is the same as P(S|D) of the hand coded portion of the sample. I wanted to learn why this assumption is more acceptable and difficult to violate than the assumptions made by the direct sampling approach, which only relies on random sampling. Wouldn't direct sampling be the more straightforward and easier to do approach, especially in the context of content analysis where completely random selection from the entire population can be achieved in many cases?
Two questions:
Similarly to @Willy624, I'm wondering when I would choose this method out of all the classification methods we have available. When would you use it? And in general, is there a decision tree anywhere that summarizes which classification methods are best for which research projects?
This paper suggests hand coding around 500 documents, but also mentions that coding as few as 100 documents could be sufficient for some purposes (such as national surveys with a 4 percent margin of error). What considerations should be taken into account in deciding how many documents to code, or in determining an appropriate margin of error?
Since the result that 500 documents are sufficient (figure 5) is based on the running example, is there a general way to calculate how many documents needed when using this method to reach a satisfactory level of confidence?
The idea proposed by this paper is indeed fascinating. I wonder if the method proposed is more suitable for texts that are closer to natural language? if the texts, also rich in social science content, needs to be rich in vocabulary as well? If we apply the method to say, a country's constitution, or sets of law, where the language is not as natural and the vocabulary is relatively much smaller than say a bunch of articles or blogs, what might be the implication and outlook?
I have two questions:
First, the authors mentioned that their method requires that the original documents contain the information needed. I guess under some circumstances the blogs we collected seem to be related to our topic, but actually not. For example, it probably only contains the keyword "presidential election" but actually talking about the weather. I am curious to know the preprocessing procedures to deal with irrelevant texts like these.
Secondly, in talking about the number of documents that needed to be hand-coded, figure 5 shows a minimal improvement as the numbers increase from 200 to 1000. Can this result be generalized? And does it mean that the numbers of coded documents in fact does not matter if we only consider the error rate? Thank you.
It appears to me that one of the primary motivations behind some of the estimation techniques used in the paper - such as the word subset-wise calculation and subsequent averaging - was that a direct computation of P(D) using regression was computationally infeasible on most machines in 2010.
How has the relevancy of this paper changed with the incredible advances in computational power available nowadays on home systems and the availability of cloud-based HPC resources?
Thie article is interesting and useful. I just noticed that it has been published for almost ten years. Although it has been cited ~850 times, it seems that many of the articles did not use the package directly. Is this method/package still popular? Do any new relevant methods/packages/programs come out?
Like many comments posted before, I am also interested in the impact and popularity of this article/method.
Sentiment analyses are, of course, important. Yet, there seems to be much more information in texts that we can use for research. For instance, what is the corpus showing on associations between a specific politician and relevant dimensions like national or internal security, poverty, environmental issues, ideas about minorities, immigration, and others? This path forward entails not thinking of content classification exclusively over a one-dimensional space (e.g., good-bad or positive-negative) but thinking of classifications in a multi-dimensional space. What limitations should we keep in mind when applying Hopkins and King's (2010) method for multi-dimensional classifications?
The technique described in this paper assumes that the categories of analysis are static over time, or at least that the language which would place a text in a category is the same over time. (To take the paper's example, it assumes 1) that "very positive/positive/neutral/negative/very negative" sentiments are universally valid categories, and 2) the language that would qualify a text to be included in one of the categories can be modeled statically.)
I'm curious what happens if the way language is used changes over the course of the model? To take Twitter discourse as an example, it actually seems plausible that new ways of talking about current events + politicians may arise, including complex forms of sarcasm and satire, that a static model is bound to struggle with. On the other hand, random sampling with human coding seems like it would be totally adequate to study such a corpus. Or is there a way that modelling can still be a more appealing alternative?
I am just a little curious that, because this nonparametric method requires "a reasonable number of documents in each category of D to be hand-coded", will the majority of researchers find this available without external help such as RAs or MTurk? Is it practical?
How did this paper played a role overall in the development of NLP/content analysis?
I'm just curious that sentiment analysis seems to rely largely on hand-coding and I head that social scientists have a reservation on taking sentiment analysis as a computational method. I'm just wondering whether it is true, and what the difference is between sentiment analysis and other analysis methods in essence.
Following @ming-cui 's idea, I'm also curious about how in reality people use this model since it's "It also requires far less work than projects based entirely on hand-coding, and much less work than most computer science methods of individual classification; it is both fast and can be used in real-time."
This paper does a very good job in modifying the objective of the off-the-shelf machine learning algorithms for social science purposes. In its essence, I think social science cares about unbiased estimates while machine learning focuses on prediction. This paper modifies estimates to remove the bias on the population distributional estimates. However, on the other hand, with the development of modern algorithms and availability of big-data, the ML sometimes does not face such a bias-prediction trade-off. More often we see that it can reduce bias and at the same time improve prediction performance.
Given this, I am wondering what do you think are appropriate scenarios for modifying machine learning objectives? Thanks!
I appreciated this application of machine learning techniques to a social science problem. Because this method requires a "reasonable" number of texts to hand-coded at the start, how do researchers use this in conjunction with the coder reliability methods we learned about last week? Also, because this method can be applied to large bodies of text, is it common that this method is scaled up into cloud and other distributed computing centers? Or do the computational gains from large-scale computing make it so less efficient but more accurate methods are favored today?
his paper provides a good design to construct a general and robust text classifier. As mentioned by @egemenpamukcu, my question is that if P(S|D) is not the same in the population and label datasets, how can we adjust for better reliability?
This paper provides an interesting approach to an alternative classification, focusing not in classes but instead in proportions. My question is the following: How is this procedure related to the problem of class imbalance? My understanding is that problems similar to those presented in this paper arise from highly imbalanced training sets, that can result in highly accurate classification but with low precision/recall. However, there are ways to mitigate these problems, for example by upsampling/downsampling the smaller/larger class. I'm curious if such an approach would work here.
Post questions about the following orienting reading:
Hopkins, Daniel J. and Gary King. 2010. A Method of Automated Nonparametric Content Analysis for Social Science.American Journal of Political Science 54(1): 229-247.