Open JunsolKim opened 2 years ago
There are certain problems of social science where there might be interest to individually classify documents (for spam/hate speech/defamation/misinformation etc.) and use these for classification problems, but also understand the proportion of documents that fall under each category. These problems might also be volatile and sensitive to time. Accounting for population drift, how can this non-parametric method be applied meaningfully to such problems?
I'm confused as to why taking a random sample of blogs and then using individual classification on these blogs (which is unbiased) can't be used for unbiased estimates of document category proportions? Couldn't you just take the total of the individual classifications, normalize it, and then make an estimate of the population proportion based on this estimate? This makes me believe that the biased results from my example above happen in strange conditions that aren't likely in applied settings.
I am pretty confused in general what this paper is doing and why we care ... would really appreciate a short, clear exposition of the method!
The authors lay out two issues with existing approaches to estimating P(D), namely the lack of randomness in random samples, and secondly, the assumption that S (the word-stem profile?) can predict D, when in reality, the opposite is true. I'm wondering if someone could elaborate on why random sampling is flawed as an approach, since I found the authors' explanation a bit hard to follow. Secondly, I don't fully understand the issue with using S to predict D, or why this is done in the first place.
In the section critiquing existing methods, the authors wrote that "when the labeled set is not a random sample from the population, both methods fail." (pg 234) Why would the proposed alternative method work with non-random sample and not have skewed results?
Like some of my classmates above, I'm also confused why estimating population proportions is biased. And how does authors resolve this problem?
I think it's pretty cool that this method handles the work of having to essentially count and sort our documents of interest in our categories and then manually coming up with category proportions. Though, it seems that this method is restricted to corpora where all the documents are more or less in the same domain - “the prevalence of particular word profiles in the labeled set should be the same in expectation as in the population set”. Broader cultural research often takes into account documents across domains - policy speeches, tweets, reddit threads, newspapers, etc. - wondering if there was a way to mcgyver the method to account for this.
Like other classmates, the paper is a bit difficult for me to understand, especially the "Issues with Existing Approaches" section. I'm wondering 1) how is the aggregation of individual document classifications used to estimate P(D) flawed; 2) why estimating population proportions can still be biased even if classification succeeds with high accuracy; 3) why the approach will also work with a biased classifier.
After reading the paper, I think the authors are trying to do a cool job to overcome ecological fallacy. However, in the "What can go wrong section" I think I found something tricky. "Third, each category of D should be defined so as to be mutually exclusive, exhaustive, and relatively homogeneous." Does this mean the limitation of the application of this approach? We can't really do analysis with something has heterogeneous attributes. And, if we want our categories to be mutually exclusive and exhaustive, I think it literally means that we can only do some classifications similar to the one the authors proposed in the paper:"extremely negative (−2), negative (−1), neutral (0), positive (1), extremely positive (2), no opinion (NA), and not a blog (NB)." If this is the truth, why don't we choose to use individual level automated classification tools and make corrections to aggregate results to group level?
I have two questions on this paper: (1) About the critical assumption in equation 7: why would we think the documents in the hand-coded set contain sufficient good examples of the language used for each document category in the population and why would this assumption be more practical? Compared to other possible methods? (2) I'm curious about which kinds of political science corpora fit this method the most. There're many different classification methods. How do we know which one is better in a certain research context?
I think the paper is doing a very brave and excellent job here! However, I think the authors are trying to establish a general procedure for performing text classification tasks. However, I wonder how general it can be. Perhaps the authors' method works well with blogs, but what about news, speeches, books, and all the other kind of materials from other languages. Also, what about different tasks besides sentiment classification?
This is an interesting one, and it's interesting that it's boggling so many of us! I think it is useful to identify the difference in goal between classifying individual documents (and aggregating) and estimating topic proportions directly. I wonder, though, would it be possible to apply this method to get topic proportions not at the highest level but maybe within different time slices?
Also got confused about the article, particularly when the authors claim that "the quantity of interest in most of the supervised learning literature is the set of individual classifications for all documents in the population... the quantity of interest for most content analyses in social science is the aggregate proportion of all (or a subset of all) of these population documents that fall into each category."
I think this article made a good point that there should be a difference between computer science's models and social science's models. Since there are a lot of data mentioned in the article, I wonder in the "how many documents need to be hand coded" section, does this 100 document rule apply for any other data, or is it true as a proportion? Also since all the examples in the paper are fairly small (a few thousand instances), I wonder if the bias would naturally be smaller if the sample size increased even without the algorithm.
Is this method language-specific? I would like to see this method's performance on other languages, especially on language from other langauge families (e.g. Sino-Tibetan).
I agree with other classmates that the paper is a bit obscure and would like to know if anyone cares to elaborate on the aggregation of individual classifications? Specifically, the methods to reverse misclassification of unlabeled documents.
The paper proposes a statistical method distinct from conventional computer science classification to estimate document category proportions, especially for social science contexts. Though the authors emphasize that their approach is simple and does not rely on strong assumptions (i.e., random selection), I was wondering - isn’t ensuring the same misclassification probability of labeled and unlabeled data also a strong assumption and would also require proper randomization?
A couple of what-ifs after reading the paper- 1) From the textbook (Text as data) we realise how important the word count in finding discriminating words and how this is useful in grouping together documents based on words and topics. By only considering the presence or absence of the word does this have an impact on the accuracy of the trained classification model? 2) Since the hand coding was done for a limited number of documents in a time frame of very particular interest (November 2006), is there a possibility this could bias predictions? Would the results be reproducible had the hand-coders used data from a different time frame?
Firstly, I was wondering what is the the advantage of estimating the proportion of documents in given categories than broad characterizations about the whole set of documents in social science context? When does the Individual-level classification becomes unimportant?
Secondly, I am confused about the process of this supervised learning. Does each text has a pre-determined category of interest?
Thanks!
I would wonder how the method mentioned in this paper would really outperform in other cases, and how it would justify itself. For me, the paper did point out that in practice we would like to know the proportion of each category of documents, but it does not persuade me to embrace the new method.
As some of the classmates ask above, I was wondering how the estimating proportion of documents is leveraged in social science settings? How is the biasedness of the method related to research in social science?
Post questions here for this week's oritenting readings: Hopkins, Daniel J. and Gary King. 2010. A Method of Automated Nonparametric Content Analysis for Social Science. American Journal of Political Science 54(1): 229-247.