1. Sampling to Measure Meaning - fundamental

lkcao commented 9 months ago

Post questions here for this week's fundamental readings:

Grimmer, Justin, Molly Roberts, Brandon Stewart. 2022. Text as Data. Princeton University Press: Chapters 1:1-7, 2, 3, 4 —“Introduction” through p.7, “Social Science Research and Text Analysis”, “Principles of Selection and Representation”, “Selecting Documents”.

sborislo commented 9 months ago

The authors cite the high cost of collecting certain text data as a potential obstacle: Given this high cost (and minimal incentives to not use the full set of text throughout the course of one's investigation), how does replication of text analysis methods (practically) work (i.e., how does one know whether the researchers' conclusions would hold up with a different subset of data)?

chanteriam commented 9 months ago

The authors discuss in Chapter 3 four guiding principles for selecting corpus data. In their discussion of the second principle, "No Values-Free Corpus Construction," they stress that "[f]ailure of the corpus to equitably reflect the population can lead to inaccurate conclusions and inaccuracies" (Grimmer et al. 2022:36). Given the breadth of textual data particularly across the internet, how can we be sure that our data collection is equitable?

bucketteOfIvy commented 9 months ago

In a broader discussion of the ethics of corpus construction, Grimmer, Roberts, and Stewart (2022) note that some "authors use harmful language, such as racial stereotypes or hate speech" (p. 36). They continue to note that such texts can be useful -- as in studies where the goal is finding better ways to filter hate speech -- or actively damaging -- as in a project where the goal is to develop a chat bot (p. 36). While both of these situations are fair and practical, other studies take a third route of actively aiming to understand the usage and dynamics of hate speech or other unsavory contents online, such as in studies of antisemitism on 4chan or in studies of suicidality in Tweets. How do we ethically engage with and communicate findings from unsavory corpuses online?

Marugannwg commented 9 months ago

As I go over the concepts and examples, I have a strong feeling that the strategy and method are particularly situational in the domain of content analysis. No best model, no true value; such ambiguity is both exciting and challenging. It often seems to me that the researchers may not fully understand the potential of their data until deep into the exploration. Sometimes, asking a curiosity-driven research question can be intimidating as you do NOT have an estimation of how much time and effort you might eventually need to have a suitable corpus to answer it. How do you think researchers, especially students with limited experience and data accessibility, can effectively navigate these uncertainties and make informed decisions about research plans and approaches?

WengShihong commented 9 months ago

In the second chapter, the authors explain the importance of reducing dimensionality of the texts. One of the reasons for making texts low-dimensional is that most of the concepts in social theories are low-dimensional. For instance, one's position in ideological spectrum can be conceptualized as liberal, central and conservative. But what if the purpose of the study is to study some more multi-dimensional concepts? For instance, one can be liberal in social dimensions(LGBTQ, Abortion, etc) but conservative in economic policies(HealthCare).

yuzhouw313 commented 9 months ago

While Grimmer, Roberts, and Stewart (2022) emphasize the agnostic nature of content analysis techniques and advocate for an iterative approach, they also acknowledge the importance of starting with the simplest text representation as a baseline (p.37). The subsequent discussion delves into the critical aspect of validation in supporting the chosen method. However, in this iterative process aimed at providing a comprehensive landscape of a social science issue, a crucial question arises: how do we strike a balance between transparently sharing the exploratory processes, including various models that may have been tried and failed with low validation performance, and maintaining a level of conciseness for readability and professionalism?

Furthermore, could one argue that the most "agnostic" approach for exploratory and customized methods, tailored to the specific research question at hand, is rooted in unsupervised learning? In the context of chapter 2 and 4, can we develop an approach to solely rely on choosing an appropriate corpus and refine research questions after iterative method/model/technique experimentation?

anzhichen1999 commented 9 months ago

Based on the principle of No Values-Free Corpus Construction, in the process of constructing a text corpus, how can researchers balance the need for a representative and comprehensive dataset with the ethical obligations to protect individual privacy, especially in light of the contextual integrity model proposed by Nissenbaum (2020)? Moreover, what methodologies can be implemented to mitigate the risk of perpetuating biases through the inclusion of harmful or stereotypical language in the training data?

HamsterradYC commented 9 months ago

Considering the inherent complexity of language and text, the goal of text analysis is to simplify this complexity (p28). So in such blurry situations where emotions or sentiments, like depression, fatigue, and anxiety, are subtly expressed or ambiguous, how hand coding, computer-assisted extraction, and topic models be used to accurately capture and represent these nuances, which more effectively or combine with different methods? Additionally, how do these approaches compare and address potential subjective biases that can arise when analyzing processes and interpreting results?

zhian21 commented 9 months ago

At the beginning of Chapter 2, the authors discuss the importance of the iterative model, which emphasizes the process of iteratively examining data and refining theories before formulating hypotheses. When the model encourages researchers to discover novel patterns or findings, it does not seem to conflict with or be different from the traditional deductive model. As the diagram (Figure 2.1) on page 15 illustrates, the only difference between the models is the first step: the deductive model is more theory-driven, and the iterative model is based on the interaction between existing data and theory. Testing for newly formulated hypotheses will still require additional data. How should we better understand or apply the iterative model in social science?

ddlxdd commented 9 months ago

In chapter two, the book talks about the data, which is a powerful source. With the analysis of the text data, it may even provide a novel way to conduct research. This approach reshapes traditional research methodologies, moving away from the linear sequence of formulating a question, gathering data, and then conducting analysis. Instead, it enables researchers to uncover insights latent within the data, insights that may not have been the initial focus of the study. Is there any disadvantage to this way of conducting research? If the quality of the data is promised, it will also be the cheapest way to conduct new research, right?

runlinw0525 commented 9 months ago

The second chapter of the book highlights the increasing significance of computational methods in social science researches to help test theories. I wonder how large-scale text data and advanced computational tools are influencing the balance between inductive and deductive research methods in social science research?

joylin0209 commented 9 months ago

In Chapter 2, the authors mention that social scientists are increasingly using computational methods to analyze large volumes of documents. Would this change the ratio of inductive to deductive approaches in research? What impact does this shift have on the development of social science theory? At the same time, I am also thinking about whether the iterative model is suitable for different types of social science research, or is it more suitable for specific fields or problems?

volt-1 commented 9 months ago

After reading the chapter "Selecting Documents", I became aware of various biases in the text selection process, such as resource bias and incentive biases are particularly crucial in building a representative corpus, especially when the subject of study is not yet clearly defined. Moreover, the constant changes in digital policies and the need for comparisons across time and sources add complexity to these biases, making them moving targets.

Additionally, the uneven distribution of document collections over multiple decades is a common issue. Building upon this understanding of biases in text selection, I wish to dig deeper into a question related to the "garbage in, garbage out" principle in machine learning. Considering the near-monopoly of English language corpora in high-quality text data, this linguistic centrism is prominent in the "arms race" for developing large language models (as seen among competitors like OpenAI, Google, and Baidu). If content trained in a particular language is widely accepted, could this lead to the formation of a universal set of values, close to a form of cultural colonization in the 21st century? How should this language-based bias be recognized and addressed in building representative corpora and developing global AI applications?

chenyt16 commented 9 months ago

In Chapter 2, the authors introduced the Interactive Model, which combines deductive and inductive approaches and enables researchers to refine their sampling and corpus based on preliminary analysis. After reading Chapter 4, I am questioning whether the re-collecting process will introduce subjectivity and bias. And if yes, how should we prevent it?

michplunkett commented 9 months ago

Throughout the assigned chapters, the author repeatedly states the importance of trying many models, having a deep and thorough understanding of the corpus, and the environment that helped produce the corpus. The justifications for these statements are intuitive and certainly understandable, as one shouldn't make statements about things they don't fully understand.

From a practical perspective, though, finding someone who has all of these talents feels incredibly unlikely. You could very well find someone who has a general understanding of most of those, but it feels like an unrealistic want beyond that. Are the teams that do this kind of research looking to accumulate generalists or experts in particular disciplines? To what degree can you reasonably ask someone to be an expert on several types of machine learning algorithms AND have an expansive knowledge of a specific niche of sociological research? If you're not looking to find that proverbial unicorn researcher, what sort of knowledge sharing do you expect between your members with only a particular domain of expertise?

YichenDai commented 9 months ago

In chapter 2 (p. 28), Grimmer et al. (2022) uses Martin Luther King Jr.'s "I've Been to the Mountaintop" speech to illustrate how text can encompass rich, multifaceted meanings that extend beyond its literal content, including historical, cultural, and symbolic contexts. So, how can text analysis methods effectively distill the complexity of historically and culturally significant texts, like Martin Luther King Jr.'s speech, without losing essential contextual meanings? What are the specific challenges in applying text analysis methods to high-dimensional texts?

Audacity88 commented 9 months ago

Grimmer, Roberts, and Stewart (2022) argue that computational methods allow social science fields that previously used primarily deductive research processes to adopt inductive paths. In 2.7.3, they imply that the main reason for the difference is that experimentation and data collection is much cheaper than it used to be. While I agree that this is a major change, I wonder if the situation is really that simple. Does altering the primary research path of a field from deductive to inductive merely result in a wider range of possibilities, or does it also alter the fundamental character of the field in a way that previous generations of social scientists might find disturbing? In other words, can the social sciences switch from observation to experimentation without altering their fundamental character?

alejandrosarria0296 commented 9 months ago

In the first chapter of Text as Data, when discussing the validation strategies for text as data research the authors compare traditional validation strategies (like prediction capabilities) with validation strategies specific to text as data research in the social sciences (how well the models provide insights into relevant concepts). Assuming these two are not mutually exclusive, what are the strategies to find a "goldilocks" equilibrium between the two in an effective and efficient way?

ethanjkoz commented 9 months ago

In this week's reading (2.7.6), the authors purport that validation is a continuous process in content analysis. Validation is key for establishing accuracy and validity of findings. Furthermore, the authors discuss placing "humans in the loop" during this process, noting that most techniques do so. This validation process heavily involves the use of human coders and subject experts. Because the authors note that not all validation methods place humans in the loop, what doe these processes look like, and how much can we trust in those results? Additionally, how does the process of validation differ between supervised and unsupervised analyses?

Carolineyx commented 9 months ago

Q1:Has the field figured out a solution on how to report findings beyond pre-registration, especially when new research questions (or iterated research questions) are generated during the data analysis procedure? Q2:Is there a way to calculate 'Quantities of Interest,' and is there any official standard for determining whether it is sufficient to carry out the research, similar to 'power analysis' in quantitative research?

Caojie2001 commented 9 months ago

In the second chapter, the authors proposed an 'agnostic approach' to text analysis as compared with the traditional structural approach. Compared with the latter, the agnostic approach focus more on the diversity considering methodology due to the complex nature of text analysis. The discussion on this point is further explored in chapter 2 with respect to the relationship between research question, corpus construction and text representation. However, since different methods of corpus construction and text representation are all inevitably developed for a specific purpose, would it be possible to establish rather structural, systematic procedures for certain stage of data processing and analysis accordingly to different natures of research questions and texts?

YucanLei commented 9 months ago

The author suggest the social science research could be largely different when the project begins and when it ends. Does this mean when we are reproducing projects, we could in fact come up with other projects of our own? I am quite shocked by this perspective because this feels almost like artistic creation, when you are drafting, the inspirations will come to you.

Twilight233333 commented 9 months ago

In the Selecting Documents section, the authors identify four types of Bias, but I believe that there is another potentially overlooked problem that appears in web texts in many areas, particularly in business-related topics. This is where a company might purchase a large number of fake bot reviews as a means of attack or writing articles about a competitor's product, company culture, etc. Or the fan groups of certain stars support the products endorsed by the stars to make more instructive comments. This may be similar to motivational bias, but it may also be a new kind of bias. How can you tell if the text comes from a personal expression or if it is a deliberate filler answer using script or code?

erikaz1 commented 9 months ago

I was most interested in Grimmer, Roberts & Stewart's (GRS's) principle of validation. GRS writes "We will know that our representation is working in a measurement model if the measures that we have created using that representation align with validated, hand-coded data and facts that we know about the social world." Well, given that there are arguably "no laws in the social sciences", how can a researcher develop intuition (a sliding scale/degrees of certainty and uncertainty) for what social theories are "true"?

cty20010831 commented 9 months ago

In Chapter 2 of Text as Data, the authors mention that they take a agnostic (rather than a structured) stance towards text analysis. Relating it to the part of measurement, where the authors argue that it is important to show that their method of measurement reflect the concept they want to measure, I am wondering how the concept of construct validity holds in computational social science research projects dealing with massive amount of textual data. Specifically, I am curious that how can researchers ensures the same level of construct validity as traditional, structured, questionnaire-based measurement while also maintaining their flexibility of generating methods of measurement? If so, what would be good standards to assess it?

XiaotongCui commented 9 months ago

In Chapter 4 of "Text as Data," the author discusses four types of bias in selecting documents. Among them, in Section 4.2.2 on incentive bias, I would like to add a point. In addition to the fact that individuals tend to post content that is favorable to themselves, I believe there is another issue: online content tends to polarize. People dislike moderate viewpoints, and simultaneously, no one enjoys consuming mediocre information. In such a situation, how can we minimize bias as much as possible?

hongste7 commented 8 months ago

The chapters discuss the use of text as a rich source of data for social science research, emphasizing an iterative and cumulative research process. How do researchers address the potential for confirmation bias in iterative research models, especially when they may inadvertently seek patterns in data that align with their initial hypotheses?

QIXIN-LIN commented 8 months ago

The explanation of the deductive and iterative models in the second chapter captured my attention significantly. I've realized that I often employ the deductive model approach in my work. The iterative model seems promising for generating innovative ideas. However, are there any constraints associated with the iterative model? Are there specific situations or conditions where the deductive model might outperform the iterative model?

Brian-W00 commented 8 months ago

How can the principles of selection and representation in text analysis, as outlined in the book, be effectively applied to improve the validity and reliability of social science research that utilizes large-scale digital text corpora?

floriatea commented 7 months ago

Considering the rapid evolution of language, especially with the introduction of internet slang, emojis, and evolving linguistic norms, how can text analysis models be designed to continuously adapt to changing language use without significant manual intervention? How do they uncover insights specific to disciplines that have not traditionally relied on text as data? How text analysis could be particularly impactful in the next decade, especially in the life science area that has academic barriers?

muhua-h commented 7 months ago

Chapter 3.4 emphasized on the importance of validations. How much validation can be considered enough? And in this process, if we find contradictory information, how do we reconcile the findings?

JessicaCaishanghai commented 6 months ago

How does the text-as-data method handle the nuances and context of language to ensure that the derived quantitative data accurately reflect the qualitative aspects of the original text sources?

icarlous commented 6 months ago

The author implies that social science research can significantly evolve from its inception to its conclusion. Does this suggest that in the process of reproducing research, we might inadvertently develop entirely new projects? And how can we possibly deals with this possibility?

UChicago-Computational-Content-Analysis / Readings-Responses-2024-Winter

1. Sampling to Measure Meaning - fundamental #57