Sampling, Crowd-Sourcing & Reliability - Orientation

HyunkuKwon commented 3 years ago

Post questions about the orienting reading and how to sample content for your projects:

Krippendorff, Klaus. 2004. "Sampling" in Content Analysis: An Introduction to its Methodology. Thousand Oaks, CA: Sage: “Sampling” 111-124.

Raychanan commented 3 years ago

According to the reading material, when using convenience sampling, analysts "do not care to make an effort" to sample from that population. I'm very surprised about this. So compared with the random sampling method, the convenience sampling is actually worse?

sabinahartnett commented 3 years ago

The described sampling methods exist under the assumption that the researcher can somehow know what the entire relevant corpus/census would look like and can thus choose a method in an informed way... This seems counter to my experiences with content analysis & sampling - oftentimes there is some awareness of what the full corpus would hold and some of the possible biases therein, but much of the corpus/population understanding comes during exploration of the possible corpus.

Is there a standard starting method for understanding the available corpus to inform sampling method selection? (and do these theoretical methods line up with the actual field of research? i.e. how often are researchers actually informed enough at the outset to choose the most strategic sampling method?)

jacyanthis commented 3 years ago

In statistics we usually assume our sample is "large enough" to achieve some form of Central Limit Theorem and/or Law Of Large Numbers. The cutoff can be as low as n = 30 in basic statistic classes, and Section 6.3 of Klaus 2004 gives some general guidance for text sample size, but Table 6.1 gives a huge range of cutoffs from 7 to 690,767.

So what are some ballpark figures for what's "large enough" for the sort of analysis we're doing in this class?

Willy624 commented 3 years ago

My question is: How do content analysis deal with the iid assumption in standard statistical theory? Or are there any other branch of statistics devoted to solve any quantitative investigation in content analysis?

For example, within these sampling techniques, it seems like the researchers should have a clear idea of what they are searching for and sample accordingly. In that case, we can almost surely say that the sampled texts are probably not iid. Another example, Table 6.1 measures the significance level of rare unit appearance, using a binomial distribution. The way they calculated for 95% significance level is probably by using binomial distribution will converge to normal distribution. Yet that holds only under the independence assumption, which I find it hard to believe in any textual setting.

jinfei1125 commented 3 years ago

This chapter introduces four main differences between traditional sample theories and content analysis sampling techniques. The last one is, content analysis, unlike traditional research, has two populations: the population of answers to a research question and the population of texts that contains or leads to the answers to that question. I think this is important to study association or causal effects.

However, I wonder if the second population is a little hard to find? I think this week's CSS workshop's MeToo paper could be an example: the first population is Tweets that give disclosures of sexual violence and the second population is the exposure of disclosure. Can you give us some other examples of two population studies? For example, how to find the second population for last week's Movies or White House Releases corpora?

romanticmonkey commented 3 years ago

How contemporary Twitter studies address the problem that the uneven user populations? E.g. on a specific topic, users of some attributes tend to speak more and other users tend to be less represented. Some Twitter accounts can be grouped by its geo-location and user profile, but others can't. How do we securely use these unidentified accounts without incurring much bias?

toecn commented 3 years ago

What does this corpus represent?! Or perhaps better, who does it represent?! Take, for instance, a traditional sociological category: class. It is feasible to think that the existing corpora tend to skew towards higher income, higher educated sectors of the population--both in terms of consumption and production. How should we think about these limitations for inference? If our project is interesting in variation across classes, what are some ideas to use the existing text in a way that is representative of different classes and allow us to compare them? How should/could we think, for instance, about the purpose of the text (e.g., newspapers vs. contracts vs. books) or the audiences?

RobertoBarrosoLuque commented 3 years ago

Is it possible that that when choosing a specific sampling technique (random, stratified, varying probability, etc) a researcher will automatically create preconceptions about the results she/he expects? I understand choosing a sampling technique is a critical step in research and should depend on the assumptions you make about the data you are studying, but I wonder if there is something like a "double-blind" experiment design in content analysis?

william-wei-zhu commented 3 years ago

What are some evaluation criteria when deciding which sampling method to adopt for a content analysis project? Can a research project use multiple sampling method at the same time?

Bin-ary-Li commented 3 years ago

It seems that content analysis for social sciences situated on the fine line between the inferential hypothesis-testing statistical framework and prediction-based big-data analysis. I think this is advantageous but can also be a very dangerous pitfall. Content analysts might take that as an excuse to be sloppy on following the best practice in statistics (sampling, for example) and also let them off the hook for building non-predictive models.

egemenpamukcu commented 3 years ago

Are there cases when sampling would make sense even when we have access to the full corpus? Is it still necessary to systematically shrink our sample when we can train/test models using the population text?

theoevans1 commented 3 years ago

Krippendorff notes that content analysis is typically not interested in the textual universe itself, but rather in a research question to which certain texts are relevant. It seems to me that the gap between those two things is larger in some contexts than others--for instance, research using social media data could be used to answer a question about social media communities, or could be used to answer a question about offline social life. What differences in approach would be necessary between those two kinds of questions? What considerations should be taken into account in considering a text's relevance to a question about a social world larger than the text being analyzed?

hesongrun commented 3 years ago

I think sampling methods can introduce some subjective bias into the study. Different sampling methods have their own 'hyperparameters' where researchers can p-hack for their own study. For example, I was considering the case of snowball sampling. Different seeds or starting points may result in vastly different sample. How do we examine the stability of sampling? What do you think are the parsimonious ways of applying sampling methods in content analysis? Thanks!

xxicheng commented 3 years ago

I am interested in the topics of inequality/social mobility. Could you please elaborate more examples of sampling surveys in this field?

k-partha commented 3 years ago

A fairly general but important question - how do we balance open-ended exploratory analysis with the risk of finding "false positive" signals that emerge simply due to statistical likelihood? This seems to be an especially important problem for computational content analysts who sift through enormous swathes of content/perform extensive sampling.

MOTOKU666 commented 3 years ago

Sampling is really important in statistical studies. I'm especially interested in the Stratified samplings. How do we evaluate a proper Stratum? I know that interaction effect or confounding may occur from different strata, but we test it after the research, not before.

zshibing1 commented 3 years ago

For example, in demographic research, sampling and weighting usually go hand in hand in order to make the sample representative of the whole population. Then, how do we construct weights for each of the content sampling methods introduced in the reading, if we happen to know what bias might need to be corrected?

mingtao-gao commented 3 years ago

My question is related to the snowball sampling method discussed in the chapter. Based on the author, snowball sampling should end when it reaches natural boundaries. The example given is the complete literature on a subject. However, in reality, can we always reach to a "complete" collection of literatures? Even when we search for literature online, we cannot fully find all papers due to copyright issues. Will such a case cause bias in sampling?

ming-cui commented 3 years ago

I have a practical question. Will editors and reviewers care the representativeness of text samples? As an example, Michel et al. (2011) selected a subset of over 5 million books out of 15 million books for analysis. This looks like a convenience sample to me. Will this be a concern for academic publishing?

chuqingzhao commented 3 years ago

I am interested in snowball sampling. I am wondering how to make inference from the population if we have no idea how large the network is? Another concern might be: snowball sampling can be used in certain positive situations, like scientific reference, but in some negative situations like drug using, hardly can we access from one node to another. How to deal with some latent nodes? How to use computational methods to estimate the relevant but latent nodes?

jcvotava commented 3 years ago

My question concerns the concept of "convenience" sampling. With the exception of either very niche or very very complete corpora, can any corpus ever really be said to be the "entire population" of text underlying a certain culture or language game? I.e. if, as in the example the text gives, a diary can be considered a sample of "convenience" which risks hiding information from the analyst - couldn't the same be said of nearly any text?

Rui-echo-Pan commented 3 years ago

I have a practical question related to sampling. When we are conducting research related to online data/platform, the general problem is that many people don't have such access to some platform, so we couldn't get the data from such a population. How could we complement or justify such drawback?

dtmlinh commented 3 years ago

Are there any practical tools/analyses that content analysts can use to do sanity checks on whether or not the sample is representative and relatively unbiased?

lilygrier commented 3 years ago

While sample size and techniques are crucial to ensuring a study is generalizable, how often do researchers actually take care to experiment with different sampling techniques to find an optimal choice? In most studies I read, the authors describe characteristics of the sample and often attempt to justify that it's representative for the purposes of the study. It does seem crucial to have a representative sample and to verify this with methods such as the split-half techniques, but is this common practice in textual analysis?

joshuabsilver commented 3 years ago

This chapter was written in the early 2000s and many new techniques have emerged for constructing ever larger corpora, are there new common sampling problems that are not covered by this chapter or have sampling issues remained the same in principle?

UChicago-CCA-2021 / Readings-Responses

Sampling, Crowd-Sourcing & Reliability - Orientation #8