Open HyunkuKwon opened 4 years ago
I think this paper's biggest inspiration for me is to provide ideas on how to select unbiased samples. My question is about Monte Carlo Simulations. In the paper, they 'begin with a simulated data set of 10 words and thus 210 = 1,024 possible word-stem profiles. ' However, I think the texts generated automatically are little different from real-word texts like ones in blogs. In blogs, speaking habits and vocabulary may vary from people to people, and they may use some words that include Internet buzzwords and colloquial words (not suitable or common in written documents). I wonder how this phenomenon influences the accurate estimates?
I think stemming really helps cutting down computing costs. However, I am worried that stemming would lead to a less sensitive classifier in some cases. For example, if topics classifying are very similar, full words may allow better classification. Should we consider not stemming in some specific applications? Or in those cases, we should use different techniques from this paper?
Achilles heel of the method? This method is certainly intuitive and evidently powerful (wonderful use of the law of probability), and the authors have done a great job discussing extensively five possible downfalls and their remedies. Despite that, all three comments so far (mine + @Yilun0221 + @wanitchayap ) all seem to express concerns in the neighborhood of stemming. I think this phenomenon makes sense as the four other limitations seem to be verifiable with cross-validation (consistency of P(S|D) across labeled and population sets), is a traditional statistical problem that is not exclusive to this method (size of the labeled set), or is rather a limitation of the social scientist that happens to be transferable to the methodological end (choice of relevant S; defining D). On the other hand, I imagine stemming has the potential to twist the meaning of texts, and this is worrisome if they are frequently used (could just be my lack of experience with stemming, though). For instance, badly (positive & negative) -> bad (negative); bigly (??) -> big (positive); and the class of negating and pejorative prefixes and suffixes, e.g. unconstitutional (negative) -> constitutional (positive), Trumpster (negative) -> Trump (positive).
Convenience sampling Since the most powerful feature of this method is that the labeled set can be unrepresentative of the population set as long as the condition of P(S|D) is satisfied, I wonder if the following procedure is justified. Gather immediately available texts (convenience) -> make that the labled set -> "inconveniently" gather a sample of texts from the population text -> use some tests (which ones?) to determine whether P(S|D) is approximately the same as in the convenience labeled set -> ...
This paper provides a quite inspiring approach to the text analysis in the social science topics. Here are my two questions:
My first question concerns with the classification. I wonder a situation where one document could be categorized in more than one labels. It is quite common for the texts in the social science that they are often hard to be classified into one category with much certainty. So I wonder whether the method introduced here could avoid this issue? One approach is to consider more classifiers, but it may potentially cause the overfitting problem, and another approach may be more common, which is to give multiple labels to one document, but it may be hard in the supervised method. How could it be tackled?
The second concern is on the conclusion that "coding more than about 500 documents to estimate a specific quantity of interest is probably not necessary", which I wonder whether it is a better way to use a ratio expressing the optimal number of coded documents than the actual number, as we should take the whole target population into account? 500 documents may be sufficient under certain size of the target, but it makes more sense to vary based on the size of the target population. Is it indeed an issue, as such a conclusion may decide how many documents needed to be hand coded when applying this method?
To address the computational difficulty of using $2^K$ (where $K$ is the number of total word stems across documents) word stem variables and the sparseness problem, the authors
randomly chose subsets of between approximately 5 and 25 words. (p. 237)
Although this is based on Professor King's previous paper (King and Lu, 2008), I feel like the model could perform better if we use word stems that have the most predictive power instead of choosing words at random. Or even more, maybe we can reduce the dimension techniques (e.g. PLS or PCA) to decrease the computational expense but retain as much as explanatory power as possible. What is the virtue of using random words, instead of carefully selected words (or dimension-reduced, less interpretable variables) in this approach? Are there any drawbacks in using non-random words (which could be counterintuitive to me since I believe at least 50% of "machine learning" is actually just feature engineering)?
This paper employs a smart design to construct a more general and robust text classifier. As mentioned by @wanitchayap and @ihsiehchi, stemming may yield a less sensitive classifier or twist the original meaning of the texts. I am thinking of using "n-grams" to alleviate this problem. To be specific, if two "n-grams" contain the same stem but they are equally likely to appear, it may suggest that they reflect different meanings or attitudes. I wonder whether it might be useful to first examine the "n-grams" and then do the stemming (reversing the order of step two and three in page 232). It is also mentioned that this approach is more computationally difficult given the large amount of word stems in the documents. When we use stemming or PCA/PLS (mentioned by @nwrim) to decrease computational cost, I think we may face the bias-efficiency tradeoff. Could we construct an objective funtion, combining both bias measure and efficiency measure, and maximize/minimize it, to determine the degree to which we reduce the dimensions?
My question adds on the first point of @timqzhang which concerns with the multi-membership problem. The method proposed by Hopkins and King (2010) categorized every text into only one kind; however, it is very likely that one document covers two or more topics. For example, the paper entitled "Islamophobia and Media Portrayals of Muslim Women: A Computational Text Analysis of US News Coverage" from previous MACSS workshop interrogated both racial and gender minorities. As far as I know, unsupervised machine learning algorithms such as topic modeling well address this issue, but I wonder how a supervised model can fit the text data? May be by creating model target categories: say, in the above example, instead of creating "gender" and "race" two labels, we may specify it as one label, namely, "gender and race"?
I have a pretty basic question about the approach described here, which has more to do with their classification than the actual estimates obtained. In the running example, the authors say they classified blog posts as being about Bush, Clinton, Kerry, etc (using a set of key words and filters) then proceeded to classify if the sentiment was positive, negative etc. However, isn't it pretty likely that a blog post would talk about both Bush AND Clinton, or Bush and Kerry, or whoever the relevant political figure is? And I don't believe they necessarily did parts-of-speech tagging to understand if the negative sentiment was toward one individual or another. How did they then identify sentiment associated with a given individual?
I have a similar question as @linghui-wu. It is pretty common for social science research papers which might be extended into two or more categories and it seems that Hopkins and King sort every text into only one category. I am wondering do we have any supervised/unsupervised machine learning technique that can tackle with this problem?
This article provides an explicit example of how to carry on the content classification in practice, which is a great article for us to learn the theory.
My question is actually related to the appendix. In the appendix, they show how to test the intercoder reliability. I can understand the whole idea, mainly explained in the intuition part, but I am a little bit confused about the SIMEX analysis and its graphic expression (Figure 6). First of all, what is the meaning of alpha being -1 and 0? I can only roughly understand the explanation in the article. Also, why is it the case that the proportion of observations is negatively related to alpha?
My question is related to the hand-coding part. Aside from the problem of inter-coder reliability, I wonder what is an effective way to select a representative enough sample for hand-coding? If it's a subjective process, would it add bias to the classification models?
My question is that if P (S | D) is not the same in the population and label datasets, how would you adjust for better reliability?
I don't really have a question about the reading itself, but I wonder if there are other hacks out there, similar to this one, that utilize a known method for a different result? Would any of these be beneficial for our work?
Also had a few doubts about the precision of hand-coding - We're estimating the proportions of the types of documents instead of categorizing each individual document according to the paper - Does this mean that we can use a smaller number of training documents and get the same level of accuracy for research?
The adjusted method on estimating population category relies heavily on the assumption that $P^{h}(S|D) = P(S|D)$. However, except the Monte Carlo simulation, we don't actually have ways to validate this assumption. And I personally think the condition is no less stringent than the previous ones. Are there works that use already categorized corpora to validate this method?
This is an excellent paper that I think will prove very useful for my final project. One question I have is whether it allows the researcher to assign multiple labels to a particular document. For instance, imagine we want to understand the proportion of tweets in the covid conversation that call it a conspiracy vs. those that call for street protests. I can imagine several tweets containing a proportion of both of these messages within the same post. What would be the best way to label such instances? Would we need to create a new category for such tweets?
This was an interesting demonstration of how social scientist's objectives for a supervised learning project might differ from others fields. I have two comments:
Discovering too few examples for one or more categories can be dealt with in several ways. Most commonly, one can alter the definition of the categories or can change the coding rules.
Could an unsupervised approach help in identifying the definitions of categories?
My question is whether there exists a parametric method to conduct the content analysis work?
I am wondering whether this method is flexible to incorporate text metadata into account?
My question is why Bayes Theorem can not be used to measure P(D|S) from P(S|D) in Page 234. And are this method guiding us to focus on P(D)(overall proportion) instead of P(D|S)(conditional proportion)?
Post questions about the following orienting reading:
Hopkins, Daniel J. and Gary King. 2010. A Method of Automated Nonparametric Content Analysis for Social Science. American Journal of Political Science 54(1): 229-247.