Computational-Content-Analysis-2020 / Readings-Responses

Repository for organising "exemplary" readings, and posting reponses.
6 stars 1 forks source link

Measuring Meaning & Computational Reading #1

Open bhargavvader opened 4 years ago

bhargavvader commented 4 years ago

Hello friends,

Comment below with questions or thoughts about the reading for the 1st week of Computational Content Analysis 2020. For your reference, it is:

Evans, James and Pedro Aceves. 2016. “Machine Translation: Mining Text for Social Theory” Annual Review of Sociology 42:21-50. DOI: 10.1146/annurev-soc-081715-074206

Engaging with your peers is recommended! You need to use 'thumbs-up' for your reactions to count towards 'top comments,' but you can use other emojis on top of the thumbs up as well.

laurenjli commented 4 years ago

I'm interested in how researchers generate labeled data for supervised learning approaches to text analysis. While text is often readily available, the task of labeling the data can be a very large and onerous part of the ML process. Are there advancements and/or techniques that are being used to label data (beyond manually tagging words/sentences)?

sunying2018 commented 4 years ago

I am interested in text analysis, especially extracting language features. Just as mentioned in this article, the most commonly used language features derived from the lexicon, but it is an undifferentiated groups of words. Since some words in certain language context have deeper and further meaning, I am just curious about is there any other words vocabulary or techniques refines this undifferentiated grouping method to achieve more accurate text analysis?

katykoenig commented 4 years ago

After reading about social dynamics and text, specifically regarding the differences and changes in speech due to power dynamics, I am curious if there has been analysis linking visual power dynamics and NLP (when a party's language reflects a position of power does their body language also reflect this?) and am also interested in applications of NLP with coded or subversive texts that require cultural context to fully understand. (For example, Guamán Poma's "El primer nueva corónica y buen gobierno," we see linguistic praise of the Spanish crown (and critique of its colonialism) but power dynamics of his drawings are more complicated).

adarshmathew commented 4 years ago

This reading was a great overview of the array of methods available in text analysis. I found the network-based and word-embedding approaches to be very interesting in how they preserve dependency and structure of language. I'd be curious about methods which integrate topological structure in a word-embedding framework, something that captures 'distance' between words/phrases while overlaying a network structure (to indicate dependence) over them.

On the flipside, I'd be curious to know if there are specific language-related tasks vis-a-vis social inquiry that NLP methods are not suited to? What do you think are the limits of these methods?

rkcatipon commented 4 years ago

Hi all! Given the complexity of language and also the brevity of text in social media, I wondered how LDA topic modeling and other dimensionality reduction techniques account for entities such as hashtags and emojis? Is there a recommended method for dealing with those text features which connect to larger conversations and have broader meanings than those indicated in the given data set? I can see a vector approach working with hashtags to extract context clues, but would that work with emojis?

If: king – man + woman ≈ queen Is this the same: 🤴 - 👨 + 👩 ≈ 👸

skanthan95 commented 4 years ago

Given that content analysis methodologies can be used to understand and predict human behavior via text, I'm interested to see how we can use NLP and ML techniques to understand the power structures within (and the evolution of) callout/cancel culture on Twitter. Can we predict which tweets are more likely to get a user cancelled?

heathercchen commented 4 years ago

This course and the week one lecture really broaden up my opinions towards sociology and social science as a grand topic. As a newcomer and a former outsider to sociology theories, I am fascinated by Figure 2 in this article, which shows how text analysis tools are implemented in the process of knowledge production. But still, I have a question regarding the details in this figure that, how can we define a certain dataset as "full data"? I mean for sure that now we can access a substantial amount of data than ever before. But these data are still a "sample", a small partition of all the data that represent our social life.

lkcao commented 4 years ago

1, for orientation: It is mentioned in Evans and Pedro (2016) that there are currently 2 paradigms for computational social sciences, targeting at confirmation and discovery, respectively (page 29). I guess I have encountered several such studies, but would like to know more about the confirmatory ones and how do they use models to generate new labels for data points. Are machine-generated labels currently accepted by social science as legitimate input for research? What is the threshold of accuracy (commonly accepted) for generated labels to be taken as reliable? 2, for fundamental technical readings: a question about "chapter 3: probability and information theory" (page 69). Can you explain a little more about the measure theory and how it is related to Machine Learning and probability?

lkcao commented 4 years ago

I'm interested in how researchers generate labeled data for supervised learning approaches to text analysis. While text is often readily available, the task of labeling the data can be a very large and onerous part of the ML process. Are there advancements and/or techniques that are being used to label data (beyond manually tagging words/sentences)?

I have tried some basic programs on my machine before, like Viterbi algorithm for tagging tasks and MEMMs and other sequential models for name entity recognition, etc. Some with decent labelling outcomes. I am sure there are many others :-) Will share them if I try more.

rachel-ker commented 4 years ago

I'm interested in how text analysis deals with "messy" text, e.g. short forms, typos, or creole languages. I'm also curious how we can understand what are conditions for a good "discovered" theory and when these methods may not be appropriate? Are there frameworks/pre-requisites we can check for e.g. Interpretability, clear distinct groups, repeatability?

bjcliang-uchi commented 4 years ago

How are lexicon sets selected--purely subjective based on personal expertise, or there are also algorithms or standards? Also, how do NPL algorithms for different languages differ in logic?

ziwnchen commented 4 years ago

Thanks for sharing this great article! I want to know that in the "exploding supply and demand for text information", do the supply & demand of all types of text information increase equally? For example, attention attraction strategy used by many content producers may lead to the growth of some specific kind of text/genre. If such the case, will any systematic bias/inequality arise in large scale text analysis? How could we potentially measure this underlying bias?

ckoerner648 commented 4 years ago

Hello, World! I’ve heard that two people stop communicating on Facebook when they become a couple. Are there similar cases in which digital data and real-world social connections divert drastically? Can we computationally try to control for them?

cytwill commented 4 years ago

Thank you for sharing so many potential usages of ML and NLP methods for launching social science research in the paper, which is exactly what our computational analysts hope to do. My specific questions are these two below:

  1. Sometimes we have clear background theoretical assumptions before exploring the data, some times not. Do you think theoretical assumptions are more of guidances or limitations for data-driven research in social science? And some social science theories are conceptual or descriptive, quantifying them to something we can computationally measure is somewhat subjective, do we have any criteria to validate the reasonability of such quantifications?

  2. In NLP, especially when we need to use user's online comments, the language context can sometimes vary the meaning of their words. Would you explain further on how to distinguish these field-oriented differences with NLP techniques, other than constructing a specialized dictionary?

wunicoleshuhui commented 4 years ago

I'm quite interested in network and vector space approaches, and specifically how they are applied in social media settings. My question is, if the vector space model is applied to analyzing large collections of short texts such as tweets, will the accuracy of evaluating distance and similarities among words and phrases be affected, and if so, how?

acmelamed commented 4 years ago

My most pressing question regarding this article is a purely practical one. Among many of the example studies referenced, such as that conducted by Goldberg et al., the corpora analyzed consisted of (in that particular case) "millions of emails" solicited from a private firm. Even with the resources of an academic institution, this seems like a daunting precedent for those wishing to employ similar methodologies. How does a researcher go about acquiring such a massive amount of potentially sensitive data from a private company?

arun-131293 commented 4 years ago

It’s interesting to read the claim that computational text analysis is able to conclude from fine differences in (externalized) language, underlying attitudes and world views. However I am skeptical of the scope of the statement, since modeling intent of the speaker is more or less impossible. However, I’m curious if it would also be possible to identify such patterns if there was a kind of code that is used by people as cover their attitudes –e.g. when Americans speak of “urban population” but in fact mean “black poor” to possibly not appear racist. Of the usage of words like "moral clarity" which is used when a tryant hitherto supported by the US government is now seen as a threat. It would be interesting to uncover what such words actually mean from the context of their usage.

kdaej commented 4 years ago

I can imagine a situation where unsupervised machine learning is used to get some intuition on what is going on in the given texts and generate hypotheses. Before the machine learning technique was available to researchers, this process was done solely by human researchers. However, since technology enables us to check if our intuition is somewhat in the right direction, I wonder how much we can rely on this method to have guidance on our research. In other words, how much can we rely on unsupervised machine learning technique to check whether we are asking the "right" question?

luxin-tian commented 4 years ago

This article reviews contemporary research that employing content analysis methods for social sciences and reveals the power of content analysis and NLP techniques in making inferences about social games in all possible fields. As mentioned in this article, recent work examines "shifts in content over history to identify changes in the social world underlying it", and phonology can be used to distinguishing dialects and "reveal distinct social worlds underlying spoken interaction". This triggers my curiosity about the applicability of NLP techniques in parsing historical contents and different branches of dialects. Is there any constraint faced by sociologists that make it difficult or even unfeasible to use current NLP techniques for historical sociological research or cultural diversity, given dramatic variations, both chronological and geohistorical, are prevailing in language evolvement?

YanjieZhou commented 4 years ago

Personally I am very interested in how ideas are transmitted through the channels of social media, especially the ones that exprience different degrees of change when read by different groups of people in culture and idle contexts. I think those changes in the ideas of the same topic according to different standpoints and ways of understanding are the main source of disputes and reflect people's deeper concepts. Thus, I am wondering whether there is any research that relates to this idea or text analyses precisely can reflect the difference between ideas that stem from different groups.

Lizfeng commented 4 years ago

This article introduces the research framework for social research using the content analysis method. Personally, one of the most important take-aways from this article is the figure of how researchers use supervised methods vs. unsupervised methods to either confirm or discover theory. This framework addresses one of the critiques of the content analysis method, namely that it is a single-test hypothesis. If we can utilize content analysis to prove, disprove or improve existing theory, without fully replace classical statistical methods, we may have more generalizable research.

yaoxishi commented 4 years ago

In this article, lexicon is mentioned as the most commonly used language unit, I am wondering whether lexicon is always the best feature to get from the text when doing the analysis, since it's hard to relate it to the context, should we sometime use other levels of features of the text and how to decide which is better?

sanittawan commented 4 years ago

After having read the paper, I am curious to learn about the potential drawbacks of the unsupervised methods in social sciences research. It is frustrating to work on unlabeled data since we do not have access to the "ground truth." What are the ways that researchers can make sure that the pattern they discovered using unsupervised methods is close to the truth? Will having more data help?

VivianQian19 commented 4 years ago

The article gives a systematic review of how computational content analysis has been used in social science research. The typology of the three kinds of research ranging from analysis that directly engages with content to analysis that explores deeper social states is interesting and it makes me think about how much (more) we can learn from computational content analysis. Since the field is growing and the article suggests a great many opportunities brought forth by computational content analysis, I wonder are there any limitations and cautions that researchers should be wary of when using this approach?