1. Measuring Meaning & Sampling - fundamental

JunsolKim commented 2 years ago

Post questions here for this week's fundamental readings: Grimmer, Justin, Molly Roberts, Brandon Stewart. 2022. Text as Data. Princeton University Press: Chapters 1:1-7, 2, 3, 4 —“Introduction” through p.7, “Social Science Research and Text Analysis”, “Principles of Selection and Representation”, “Selecting Documents”.\

Thiyaghessan commented 2 years ago

Hi all,

In chapter 2, the authors propose splitting data into training and testing sets to support the iterative approach to social scientific discovery that they propose. They then address detractors who argue that such an approach will not work for rare events by saying that analogous interventions will always exist (they provide the NFL example). They do, however, acknowledge that for certain rare events that happen only once (for example, use of nuclear weapons off the top of my head), you can only split the data once. However, they do not provide further details on what the researcher should do in those situations. How can we maximise the data we have to iterate over in rare events without compromising on our ability to test our hypotheses? Appreciate any and all help.

isaduan commented 2 years ago

Could someone give some more concrete examples of 'validation' and specific methods? Can't get my head around different steps & concepts.

pranathiiyer commented 2 years ago

Hey everyone, Adding to Thiya and Isabella's points, I had two doubts.

The book also mentions that reduction of power is an argument against splitting of data, and goes on to elaborate that this decrease does not out weigh the risks of not splitting the sample. I was wondering if a decrease in power is guaranteed if the sample is split? Moreover, if this would be a function of the data that is actually held out, and if there are any methods that cater to adjust for the reduction?
Chapter 4 talks about the different kinds of biases that could emerge in text analysis. Given that sample selection bias tends to be a problem especially for social media data such as tweets, will there always not be questions of generalisability with these research questions? More so how could one account for populations not represented on the social media platforms especially if sometimes it might not be easy to identify the demographics or other characteristics of the users themselves?

linhui1020 commented 2 years ago

Hi everyone,

The book discusses how scholars could appropriately design research using available digital contexts, and potential sources of bias occurring in the data. My interest is in the incentive bias part. As the authors of this book indicate, incentive bias comes from when specific treatment (initiative, policy) changes the behavior of human beings, researchers have to consider whether the text reflects the truth. Grimmer et al (2022) propose such a concern but do not lead to some suggestions. For example, how to identify such bias in different stages of research? How to ensure the completeness of the text that we got? They mention that Cheryl Schonhardt-Baile's research adds an additional interview part trying to reveal the authenticity of texts. I really appreciate this design which gives us more valuable information and pushes us to step further and dig deeper. But since the interview itself is time-consuming and might require researchers' personal connections, is there an alternative for us to validate the incentive of authors?

ValAlvernUChic commented 2 years ago

Hi all!

Jumping off Pranathi's 2nd question - I was wondering how social research (at least computationally) can be effectively/reliably conducted for groups in the absence of substantial textual contributions from them. It seems like the research questions we can ask and answer will ultimately be left to the data available/we can reasonably collect but this seems like we'd effectively be leaving out certain populations (I'm thinking vulnerable, underprivileged populations) from participating in academic inquiry.

Thanks all!

GabeNicholson commented 2 years ago

@isaduan

Could someone give some more concrete examples of 'validation' and specific methods? Can't get my head around different steps & concepts.

The author briefly mentions it, but when using the data to create a theory (inductive approach) what is happening is something called "multiple comparisons" in statistics. This means that any hypothesis created after looking at the data must have its p-value adjusted because we have essentially "cheated" by looking at the data first. So the validation method MUST come from a different, but similar data source (think about any study that tries to replicate a previous finding). Imagine the original study was estimating political sentiment from tweets on Twitter. Then our "validation" would be testing this political sentiment model on a completely different set of tweets, that way the theory can be refuted or supported. In machine learning, a common approach is something called "cross-fold validation" which is essentially the exact same thing mentioned above except when you initially collect the data, you hide a certain percentage of it to be used for future testing.

Also, in the case of validating the supervised label creation by using hand-coded labels, the author mentions keeping "humans in the loop." An example of this can be seen from the book Bit by Bit with the Galaxy Zoo project where people would classify galaxies. They would only keep user-submitted labels if a majority of people also classified the galaxy in the same way and they also had experts double-check ones that were more ambiguous.

mikepackard415 commented 2 years ago

My question has to do with the process of the content analysis research as iterative and cumulative, and whether it is presented as such in most papers in this field. My sense is that while most scientists may practice this iterative technique, they tend to present their work as purely deductive, flowing from question to concepts to data to results. It looks cleaner and probably reads a bit easier that way, but the downside is we don't get to learn all the various lessons they learned in that meandering process. I guess I'm wondering whether you (Prof. Evans and classmates) think this is changing at all, especially in this field of content analysis.

facundosuenzo commented 2 years ago

So far, I found the book very interesting (mainly because of the lack of excessive technicalities, which makes it feel more appealing to less specialized audiences). Regarding chapters 3 and 4 I was wondering the limits/complexities that could imply having a communicative action that travels along with different types of texts and/or documents. I don't think I have a particular question but I was imagining, for instance, something that "starts" on social media and then is transformed or incorporated into a specific news outlet article and "finish" on Wikipedia. Is the possible/recommended to trace a research question through a multiplicity of documents/texts/ with an eventual difference in nature? (text/image/video).

sizhenf commented 2 years ago

This book offers a comprehensive introduction to the design and techniques of applying text-as-data methods in social science research. My question regards in particular to the process of selecting documents. As many of the texts come from the giant body of text data available through the Internet, researchers often have to come up with methods to coarsely refine their target document before analysis. One way is to use keywords, as mentioned in chapter 4, but it entails subjective biases depending on the set of keywords chosen to select the texts. The book refers to a few papers that introduce computer-assisted methods to help prevent this biasness (for eg, King, Lam, and Roberts, 2017), but it seems to me that these methods still involve a big part of man-made decisions. Would there be a "better/more rigorous/more automated" way to select our target documents?

LuZhang0128 commented 2 years ago

One question is: How to avoid p-hacking in the iterative model proposed by the author? It makes sense to me that distant reading/text analysis using computer algorithms or statistic models can reveal general trends that are unseen before. Hhowever, social science does not have a pre-registration process like that of clinical trials. Meanwhile, although mentioned by the author that we can split our data into training and validation, I don't think splitting data in social science is an easy task. For instance, some data may have an internal relationship (e.g. text data with network property). I wonder if this interactive process will lead to p-hacking or overfitting of the data to some extent? In other words, the researchers are concluding some trends that are in fact not true but significant by multiple testing/pure luck? Also if splitting the data is causing other forms of bias, when should we choose to split the data and when should we not?

Sirius2713 commented 2 years ago

Adding on linhui1020's questions about incentive bias, my question is how to protect privacy during textual analysis? Because people change/curtail their behaviors because they don't want other to know their genuine thoughts and their privacy, especially in government agencies. How researchers can protect participants' privacy while selecting appropriate corpuses?

hazelchc commented 2 years ago

The authors mention four types of sample selection bias. For incentive bias, I agree that "individuals as well as organizations have incentives to fail to record or to hide or destroy evidence that could cast them in a negative light." Individuals can perform and behave differently online and offline. While powerful tools can help us accurately identify and quantify people's emotions and states as expressed in the text, we may not be able to uncover their genuine states and emotions. How should we eliminate the bias?

Emily-fyeh commented 2 years ago

From the last quarter, I've read about the research papers on Chinese speech censorship by King, Pan, and Roberts (2013), which is really a pioneering study re-emphasized at the end of Chapter Two. The series of their work perfectly demonstrated the discovery, measuring, and causal inference validation on a single large-scale phenomenon of online speech censorship. I think the reason why this topic is important is due to the transparency of the Chinese government. The online forum data serves as a subject of observation, for the purpose of validating the hypothesis of collective action censorship. If (counterfactual) we can access the internal document of online censorship from the Chinese authority, then the research would not need to adapt the content analysis methodology. (Also, the Chinese censorship guidelines can change frequently and drastically, researchers may need to collect long-period data to know the transformation of the censorship policy.) For me, I would be more interested in further analysis of how this censorship affect the online ecology in China, such as how people invent new net slang or self-censoring, or the different attitude from different local censorship authorities. It seems like capturing a vague concept within collective expressions makes more sense to me when it comes to utilizing content analysis. If anyone has other instances in mind, I would like to know more cases that use the content analysis method to answer a question that can possibly be directly answered by some missing piece.

melody1126 commented 2 years ago

One example related to retrieval bias would be algorithmic confounding. (chapter 4.2.4) In the example of looking for Danish ghost stories, if we went to Google Books and looked for stories we thought were Danish ghost stories, we could be at the risk of algorithmic confounding in the data retrieval and corpus selection. How would we mitigate algorithmic confounding?

konratp commented 2 years ago

The fourth chapter tackles the question of how to select appropriate sources for the population about which inferences are being made. Pointing to Twitter as a source of large amounts of text with which some analyze the general population’s views on politics, the authors warn of sample selection bias occurring. Yet, they also contend that if a corpus is not representative of the population, it can still lead to interesting insights about a given topic. I wonder, what are the standards by which we should select our questions, populations, and quantities of analysis, and do they differ between our class and "real life" academics (who might be more prone to taking "riskier" approaches)? Is there a risk that we shy away from questions that are important to ask because we’re afraid to fall victim to one of the four types of biases outlined in the chapter?

MengChenC commented 2 years ago

I am wondering what are the similarities and commonalities between textual and image data? Since in some scenarios and applications the models are interchangeable in both domains, such as transformer, and more surprisingly (or should be less surprisingly), the models perform well for both data types. So I am thinking about what it is the case. Thanks.

YileC928 commented 2 years ago

Regarding incentive bias mentioned by the authors in Chapter 4: if individuals tend to portray themselves differently from what they truly are, how could we reduce such bias and find out the actual state? For large-scale online text, in particular, it is nearly impossible to talk to each subject and validate their claims and emotions.

NaiyuJ commented 2 years ago

I really appreciate how they explain the differences between computer scientists and social scientists who are using the ml tools and techniques.

I'm curious about how these kinds of advanced techniques influence how social scientists do research. As the authors indicate, "In this linear view, researchers must somehow a priori know the concepts that structure their variables of interest; then, they use a strategy to measure the prevalence of those concepts; finally, they develop a set of hypotheses and a research design to test the observable implications from their stated theory (King, Keohane, and Verba, 1994)." Is it possible that we can use machine learning to derive the hypothesis or are there any changes when researchers have machine learning tools in mind?

yujing-syj commented 2 years ago

This book offers a more practical, fundamental and intuitive way to understand content analysis compared with the review we read in this week. From this book, I have two questions: 1) Is there anyway to deal with the language that always have multiple meanings in terns of one word? Just as the author mentioned in chepter 4, Hungry Ghost's: Mao's Secret Famine is not a ghost story. This kind of situation happens a lot in the real life espacially in the oral language. I know this is an existing problem. Is there any recent method that could alleviate this problem? 2) After understand the "incentive bias" in chepter 4, I doubt how all the documents related with the government in the past are validate, since the governers hundreds years ago had power to control the political correctness as paper and books were the media. Even we ignore the other bias, we also may not generate the truth of the past.

kelseywu99 commented 2 years ago

Chapter 4 goes over biases one may encounter when selecting samples that properly represent a population. In section 3, the authors note that more than often text corpora are “found data” that is being released by agencies that dictate what may or may not be made available for researchers. While ethical considerations and measures may be taken in this case, I was curious if this situation may be countered by any means? Is the research conducted by Gill and Spirling (2015) an example of how to analyze classified “found data” and make induction of what information the U.S. government classified? What are some measures need to be taken when data transparency is dictated by the other party, i.e. governmental institutions?

hshi420 commented 2 years ago

In many existing twitter datasets, the creator only offers tweet ID due to twitter's privacy policy. For some reason, many of the tweets can't be retrieved because they are removed. I was wondering if this case counts as a medium bias or a retrieval bias, as the problem occurs during retrival, but the problem might not occur if the data was collected on other kinds of media.

AllisonXiong commented 2 years ago

I found the four types of bias introduced in chapter 4 insiring. (1) As for the resource bias: the availability of text is not only influenced by the resource of certain population, but also the internet literacy of them. This can also be categorized as medium bias. People who rely more on communication that is not recorded (in-person conversation, uncensored phone calls, etc.) would inevitably be underrepresented. In other words, the preassumed population of computational content analysis differs from the general population. How can we compensate for such bias? (2) Censorship is a great example of incentive bias, and a obstacle for computational content analysis. Both government and online platform are actively removing contents (Twitter, for instance, has been removing bots-like accounts and fake news articles). So in a retrospective research that examine the shift of collective attention, or collective information flow, we may get systematically biased conclusion due to the missing data. How can researchers mediate this influence?

zixu12 commented 2 years ago

Online text data, such as the title/review of goods, are important text resources in this digital world, but my concern is that they are not "good" for extracting inference in that the data are not accurate, misleading and messy. Does most technics talked in the book have to work well with the "good" data at the begining? If yes, it there any way we can deal with such problem? Thank you!

chentian418 commented 2 years ago

As the book mentions in chapter 2.7.4, the goal of text analysis methods is to develop a distillation that reduces this complexity to something simpler and more interpretable in social sciences. Although reducing text dimensionality make text more compatible with social science concepts, I wad curious about how to decide the extent to which we should reduce the dimensionality of text? For example, how do we balance the interpretability and the information loss due to dimension reduction? Thank you!

hsinkengling commented 2 years ago

This is more a technical question: when the authors suggested that researchers split the data (p.45 in the pdf) into smaller samples at every stage of the project, this seems to mean that you have to predetermine the number of times you get to have "fresh" data. (which may limit the number of "iterative loops" (p.21) you get to have before you form the hypothesis)

Would it be possible that instead of splitting the data into completely different samples and only working on one at a time, you do a random sample of the same complete dataset every time you want to test it? Although you'll still get some old data mixed with the new ones, this may provide a lot more tests than simply splitting it.

Hongkai040 commented 2 years ago

I feel like this is a handbook for people who want to using text for their research questions. It provided many valuable principles. Though some of them are easy to understand, but hard to follow or don't offer solutions. In Section 3.2, the principle is "There is no values-free construction of a corpus. Selecting which documents to include has ethical ramifications." That is true. But how should we deal with it or minimize the ethical ramifications?

chuqingzhao commented 2 years ago

The book points out four types of sample selection bias. I found medium bias quite inspiring. The author suggests that "Text outside of social media is similarly influenced by its medium [...] researchers should read and interact with texts in their original context" (p.83). In cases similar to texts with emoji, I wonder how should researchers apply computational tools to integrate more information clues together?

ttsujikawa commented 2 years ago

It is really interesting to challenge advanced approaches to bridge data and insights using text and image as data. The book explicitly and thoroughly explains how social scientists should deal with such data types to explore academic findings. My question lies along with a big question of "do actually texts and images reflect real-world phenomena?" Image data often be selective when we utilize that for research and researchers may choose the data that he/she wants to see rather than comprehensive reflections of targeting phenomena. Also, textual data could be manipulative since it solely depends on individual behaviors.

Under such situations, how could we minimize risks where textual and imaginal data fail to capture real-world phenomena?

weeyanghello commented 2 years ago

It seems to me when we are analyzing text through computational analysis, we are assuming a kind of stability in language use/practice, such that usage of some X word will be the same across all usages in the same time period. My question, then, is how we can analyze targets like sentiments around social movements across different time periods, such as in LGBTQ movements where the word "gay" and "queer" has drastically changed their meaning due to historical processes of linguistic reclamation?

UChicago-Computational-Content-Analysis / Readings-Responses-2023

1. Measuring Meaning & Sampling - fundamental #53