Classifying Meanings & Documents - Hopkins & King 2010

jamesallenevans commented 4 years ago

Share you questions here about:

Hopkins, Daniel J. and Gary King. 2010. "A Method of Automated Nonparametric Content Analysis for Social Science". American Journal of Political Science 54(1): 229-247.

katykoenig commented 4 years ago

On pg. 237, Hopkins and King note the issues of (1) the number of word stem profiles is large (2^k) and (2) that the number of potential word profiles is significantly less than the number of word stem profiles as reasons for randomly choosing subsets of words instead of using the entirety of a sample. Is it possible to use principal component analysis for dimensionality reduction instead, and if so, would their method of characterizing sets of documents still work?

Also on pg. 243, they suggest that if a researcher using their method wants to reduce uncertainty, they need only hand code more documents and rerun the algorithm. At what point should we be concerned about overfitting our data? While there is a tradeoff between overfitting and uncertainty, are their guidelines regarding their optimal levels?

lkcao commented 4 years ago

I may be wrong, but this method gives me the impression that it is just a multinomial logistic regression based on word distribution of different text categories? It claims that it does not produce individual level prediction, which is not true, because it trains the model on small batches of word (5-25 randomly chosen words) and this is in fact treat randomly generated 5-25 word text as an individual unit. When the model is built on the whole corpus (instead of this small batches) then the P(s) will be a vector with 1 in every position and estimation becomes impossible. The example provided by authors themselves have every text piece with 10 words. In this case, treat text with similar world length as a unit or treat original text piece as a unit will not lead to bias. But in real world, when some people have really long blogs while others have relatively short blogs (for e.g., in China, all supporters for the government has really long comments while those opponents do not dare to say anything, or just say something very short), the distortion of patterns would be systematic. Anyone has similar thoughts? (or did I miss some points in this paper?)

ziwnchen commented 4 years ago

This paper reminds of the dilemma between precision/recall or bias/variance, and thus individual classification/group category proportion...Is this impression correct? If so, is there any potential way to find the balance between the two needs? That said, to balance between computer science and social science lol.

laurenjli commented 4 years ago

The authors claim that the method proposed is more robust with respect to changes in P(S) and/or P(D) on page 237: "In contrast, changes in either P(D) or P (S) between the labeled and population sets would be sufficient to doom existing classification-based approaches. For example, so long as “idiot” remains an insult, our method can make appropriate use of that information, even if the word becomes less common (a change in P(S)) or if there are fewer people who think politicians deserve it (a change in P(D))."

However, under the current approach, if there is sufficient enough change in language such that "idiot" is no longer used in the same way, isn't it possible that may change the labeling and therefore the number of words in the random subsets? In the same way that a classification-based approach would need to be re-trained, wouldn't this approach also need to be re-calibrated to account for the shift in language meaning?

wunicoleshuhui commented 4 years ago

In pages 242-243, Hopkins and King talked about the need to choose the right number of word stems to minimize bias in P(S| D) and to use training and test sets to determine the appropriate word stems. While sparseness bias is addressed in the article, what are the potential solutions bias that results from the opposite situation, which is from word stems that are too common but are of considerable relevance to a research project?

tzkli commented 4 years ago

Hopkins and King state that one of the assumptions for Equation (7) is that P(S | D) is the same in the hand-coded set and the population set. Partially addressing the concern mentioned by @laurenjli, they suggest researchers hand-code some additional documents for documents that span a long time period. I'm wondering what would be the rules of thumb for determining how large the time interval should be and how much of the data set should be hand-coded for each time period?

ccsuehara commented 4 years ago

Hi, I was wondering why social sciences care more about the generalization of the population of documents (e.g. proportions, or let's say, broad classification) rather than individual classification.

The authors make a very important point by offering quicker estimating, but meaningful tools (non parametric ones) that are suited to the needs of the research. I was wondering if there are any other examples in which social sciences methods needed to be tailored, or re-suited, like in this example, from computer sciences ones.

ckoerner648 commented 4 years ago

Hopkins and King 2010 present a method to categorize different types of documents through text analysis. They sort every text into only one category. However, there may be documents that cover two topics equally. I’d be curious if it is possible to write a code that can independently decide between those cases and sort documents that fit one category into one and documents that fit two into two?

heathercchen commented 4 years ago

Hopkins and King present a novel and simple method for estimating the proportions of different categories in a target population. The mathematical intuition behind this problem is simply the Law of Total Probability. I have one detailed question about the arguments presented in this article and one about the future applications of this article.

As the authors mentioned in p.242, "coding more than about 500 documents to estimate a specific quantity of interest is probably not necessary". Is the sufficient amount of the hand-coded sample related to the number of the whole target population? For example, if we are aiming at categorizing a population of a few thousands of text materials, maybe 200-400 hand-coded samples are enough to obtain accurate estimations. But if we are going to generalize a population of hundreds of thousands of texts?
This easily accessible and amenable method is well suited for analyzing text materials. As we can exhaustively divide a paragraph or a sentence to finite word segments. But it is not the case with images, which are hard to "divide". So my second question might sound quite unrealistic, is there any possibility that the method proposed by Hopkins and King being applied to image categorizations?

sunying2018 commented 4 years ago

On page 241, this article discusses the problem of how many documents need to be hand coded as train set. It comes to a conclusion that "coding more than about 500 documents to estimate a specific quantity of interest is probably not necessary, unless one is interested in much more narrow confidence intervals than is common or in specific categories that happen to be rare." In this article, this conclusion is based on the RMSE vs number of hand-coded documents plot and find that the drop of RMSE will be diminishing after 500. But I am confused about how this conclusion can be generalized to other corpus since the RMSE plot is only for certain corpus and I think that the nature or characteristics of documents may have influence on the number of hand-coded documents as well.

rachel-ker commented 4 years ago

In the article, the authors also acknowledged the importance of the labelled data. It seems crucial that the human labelling at the beginning be accurate, but I was wondering how do we then evaluate the automated labelling given the data that we have. It seems crucial we test if this method would be useful in our case of applications. Would we have to extrapolate this from the cross validation tests done on the labelling set? Are there potentially other sanity checks we could do when tested on the larger unlabelled set?

In addition, I wonder if this method would require a scaling up of the size of labelled data linearly with the number of categories we decide to code for and if biases would differ for small or large numbers of categories?

di-Tong commented 4 years ago

I have the same question as @ckoerner648 , how does this model deal with multi-membership problem? Social science research usually deals with nuanced texts and categories that create situation in which a document would fall into more than one categories (for instance, for a task that aims to classify the type of inequality a document discuss, an article that equally discuss gender inequality and racial inequality should be classified with two categories). While there are unsupervised machine learning methods, such as topic modeling, assume multi-membership for each document, what does a supervised model for multi-membership documents look like?

bjcliang-uchi commented 4 years ago

I know this is not about the main theme of this paper, but I am very curious: how to "develop specific filters for each person of interest" so that we can 'exclude others with similar names"? Using the term "Bush" for "President George W. Bush" seems to include too much noise--at the very least, George H. W. Bush might be included. I did a similar study before and was stuck in this step. Or, is there a systematic way to reduce such noise?
Also, I remember reading papers that use Subject-Verb-Object (SVO) models instead of the "bag of words" to cluster words. When is the SVO model necessary and when are the noises of an easier "bag of words" acceptable?
Btw it is mind-blowing to think about the difference between P(S|D) and P(D|S)

Dominiquo commented 4 years ago

I agree with the comments regarding the multi-class issue. Something like this can obviously be useful for gaining insight into a massive set of documents, but if the goal is to convince the reader about the true state of a set of documents, saying things like "since the words chosen are by definition a function of the document category," makes me hesitant about the level of insight into language. So with the "what can go wrong" section in mind and in the present, a decade later, where might a method like this fit in a more "substantial" social science project?

deblnia commented 4 years ago

My thoughts are similar to @ccsuehara's and @ckoerner648's -- I'm interested in why the social sciences tend to use aggregate measures while the rest of the classification literature is "focused instead on increasing the accuracy of classification into individual document categories." This, along with Hopkins & King's (perhaps problematic) assertion of mutually exclusive and exhaustive categories, which assumes knowledge a priori makes me wonder about SVMs for unsupervised clustering.

I might just be a little confused about SVMs in general -- I understand that they're a non-parametric and specifically supervised method in this context, but do they necessarily have to be?

HaoxuanXu commented 4 years ago

The thing that always seems fuzzy to me is the actual utility of the categorization of the entire document. What if a document presents multiple topics?

alakira commented 4 years ago

Since I'm not familiar with text based classification task, the way the authors reduce the dimension into feasible level seems to me too hasty. The authors mentioned that "the optimal number of words to use per subset is application-specific", but is the method still effective if we want to know the proportions of groups which have only a subtle difference? Nonetheless, I was excited about the way they skip the classification task and reach what they want with less restrictions.

yirouf commented 4 years ago

The article talks about a faster method that will be beneficial for social science's emphasis on generalizability. Hopkins and King discuss the method for classifying different types of documents using text analysis. But I do think content analysis could be tricky sometimes. In my field, especially in developmental psychology, a lot of literature dose comparison between two parenting, teaching styles. It just strikes me that those texts would have two or three topics that are equally important. And I'm curious about how this is done in Hopkins and King's method, or did they anticipate this to be diminished by large samples?

sanittawan commented 4 years ago

On page 242, Hopkins and King talk about five potential problems when implementing the proposed method. The second issue is about making sure that the documents in the hand-coded set contain good examples of the kind of words and language used as in the population set. In practice, I wonder what the best strategy for sampling documents to be hand-coded will be. If the body of text that we are working on is very large and covers a long period of time (thus susceptible to language drift), should we take a random sample from each period (e.g. year) and combine them for hand-coding?

skanthan95 commented 4 years ago

I’m interested in how this approach picks up on implicit meanings within speech. How might this algorithm detect and sort these statements correctly? In this vein: given that for many years, computer scientists and social scientists were not collaborating/sharing insights, to what extent can pure-CS approaches to content analysis (some outlined on p 232) be applied to answering social science questions?

cindychu commented 4 years ago

In this article, the authors stated that the calculation method (without individual classification) they put forward is suitable, even when the distribution of P(S) and P(D) is different between labeled sample and the population.

However, I was wondering, for machine learning methods, will this difference of P(S) or P(D) between sample and population bias the result? if it is, to what extent it might impact the result in machine learning? to my knowledge, it is minimal.. Especially considering the computational difficulties and complexity of the method the authors put forward, its actual efficiency compared with machine learning methods remains unclear.

jsmono commented 4 years ago

The authors recommend using the bag of words approach to represent the data and clarified concerns related to the method. Their explanation is persuasive but under what circumstances, bag of words is no longer applicable? Are their any other methods that can supplement the shortcomings of "bag of words"?

cytwill commented 4 years ago

I think generally this paper shows a good framework for categorical exploration among text, and the authors also mentioned much about the concerns and applications of their methods. My questions are the following:

Is the point where the marginal gain starts to decrease always our choice to quit more hand-coding?
How to make those randomly selected words (unigrams) more consistent with that of the population? (Other than increasing the volume of selected documents)

luxin-tian commented 4 years ago

This paper introduces a clear method and provides a practical toolkit for estimating social aggregates. One detail I am curious about this method as well as others for representing text as statistical variables is that, as many preprocessing treatments have been made before statistical inference, can the same tricks directly apply to languages other than English? For example, the authors suggest "...empirically, most text sources make the same point in enough different ways that representing the needed information abstractly is usually sufficient..."(p232). Since little background or citation has been given, I wonder what are some necessary steps before applying the methods for multilingual content analysis?

chun-hu commented 4 years ago

Like @ziwnchen, I'm wondering how we should balance the dilemma between individual classification and group category proportion. What if our issue requires both methods?

iamlaurenbeard commented 4 years ago

Similar to above noted questions, I also noticed that Hopkins and King's approach classifies documents on the basis of a singular sub-category. Is there a way to capture the possible crossover between multiple categories in a feasible and efficient manner? What are some ways that it may be useful to overlap categories vs restrict each article to one?

rkcatipon commented 4 years ago

@heathercchen Thank you for giving the math intuition, so helpful to have it simplified!

I’m interested in the relationship between the depth of differences in real-world opinion and the power of the sorting algorithm. Specifically, I wonder how nuanced the categories can be that a text analysis software can sort with the necessary correctness. Hopkins and King divide between spam, liberals, and conservatives. However how good would our sorting be if we’d look at subcategories of conservatives only?

arun-131293 commented 4 years ago

As mention by @di-Tong and others, the categorial classification is a shortcoming. I was wondering if topic modeling can be modified to classify documents, using the criteria that a document belongs to a topic if majority of the words in that were most likely generated by the topic topic. That way, you can create for each document a ranking of the different topics, based on how many words are most likely generated by a topic within that document.

vahuja92 commented 4 years ago

My question is similar to @ccsuehara's. I'm unfamiliar with social science approaches to research, and am curious about why social scientists are more concerned with broad categorizations of whole sets of documents instead of individual documents. Computer scientists seek to classify individual documents to predict a certain outcome (ie. predict support for terrorist organization, as the authors mentioned). Do social scientist generally not try to predict certain outcomes with their corpora?

meowtiann commented 4 years ago

Hopkins and King 2010 present a method to categorize different types of documents through text analysis. They sort every text into only one category. However, there may be documents that cover two topics equally. I’d be curious if it is possible to write a code that can independently decide between those cases and sort documents that fit one category into one and documents that fit two into two?

Then it would be just finding all the mentioned categories in one blog, which is not effective than identifying only one. But I guess it is viable to find the top two categories in one blog.

I am curious about the secret weapon against spam blogs.

kdaej commented 4 years ago

The methods used in this article classifies a large set of text data into subsets of "exclusive and exhaustive categories." However, in the real situation, it might be common to have more ambiguous and overlapping categories. Although human coders are known to be good at noticing the key differences between text data and subdivide it, there remains a question of agreement among coders. If there is a large variation between coders, that might suggest that the categories are not inherently exclusive but partially mutual. I wonder if this ambiguity would matter for their methods.

YanjieZhou commented 4 years ago

When computing the proportion of words, I think we are supposed to take the length of sentences into consideration, because it is hard to determine that a longer sentence will definitely express more intense meanings compared with those concise sentences. I wonder if this problem will impact the accuracy of the research results.

Lizfeng commented 4 years ago

This article focuses on improving the classification of textual data by providing an unbiased estimate of the proportion of documents in given categories. This method requires a careful effort to properly define categories and to hand code a small sample of texts. Two of the problems of existing machine learning approaches of classification are: the labeled set may not be a random sample and the modeling P(D|S) is problematic in real-world scenarios.

VivianQian19 commented 4 years ago

Hopkins and King give an overview of using supervised methods to do content analysis in social science research as well as ways to reduce bias. I wonder which SVM method has the best performance in classification tasks? In the Appendix, they discuss the SIMEX, i.e. simulation-extrapolation approach to address misclassification. I wonder how common is this approach used?

yaoxishi commented 4 years ago

This paper provides a good framework about classification algorithms of text analysis, while the models are trying to provide unbiased estimates of the results, I am wondering that how different extent of preprocessing of the text would influence the result of statistical analysis of the text.

Computational-Content-Analysis-2020 / Readings-Responses

Classifying Meanings & Documents - Hopkins & King 2010 #7