1. Measuring Meaning & Sampling - oritenting

JunsolKim commented 2 years ago

Post questions here for this week's oritenting readings: Evans, James and Pedro Aceves. 2016. “Machine Translation: Mining Text for Social Theory”. Annual Review of Sociology 42:21-50. DOI: 10.1146/annurev-soc-081715-074206

Thiyaghessan commented 2 years ago

Hi all,

I appreciated the article's categorising of computational text analysis articles into three categories, content, processes and states. I am confident of computational text analyses' ability to achieve the first two purposes but not the last. Specifically, the identification of sentiment and preferences. There are obvious word choices/sequences that make positive/negative/neutral affect from speech/text easy to decipher. However, more nuanced inferences of sarcasm/irony are more difficult, especially given the changing context in which these sentiments are expressed. What are some of the new advances in the field of sentiment analysis that address these limitations and can accurately detect nuanced sentiment at the individual level? I would appreciate suggestions to papers/articles etc. Thank you!

mikepackard415 commented 2 years ago

The article mentions that concerns about non-rigorous data mining should be ameliorated by the explosion of available high-quality social data, and that we may be entering a "Renaissance of discovery" in computational analysis of text data. Given that we are now 5.5 years on from the publication of this article, I wonder whether we think this optimism has been validated. If not, do we have a sense of the major unforeseen challenges the field has faced?

isaduan commented 2 years ago

Hi all,

The article mentioned that at the time there 'was' a lack of good models for higher-level linguistic discourse, e.g. how sentences relate to one another, aggregate to paragraphs, and more or less effective arguments. Do we now have better models of higher-level linguistic discourse? One that, for example, identify areas of disagreement or agreement in congressional speeches/ judicial records?

Adding to Thi's question, I wonder what are the limiting factors for detecting nuanced irony - is the high cost of obtaining training data, or the theoretical construct of irony is not very clear (i.e. too many border cases) even for humans, or understanding irony requires contextual understandings (which connects to 'higher-level linguistic discourse)?

pranathiiyer commented 2 years ago

Hey everyone! I had two questions. With respect to pos tagging, how does one tackle situations where some kind of pronoun ambiguity might occur? For instance in a sentence such as "Bob and Ash went to the market, and he bought an apple". I was curious to know if and how situations like these are dealt with. Secondly, the article mentions that most resources for text analysis are available for high level languages. I was wondering if since the time of publication of this paper, new libraries have been written for other languages, or perhaps new ways of analysing a broader set of languages have been developed?

GabeNicholson commented 2 years ago

There are some concerns in the humanities about the validity of using these digital text methods to uncover real epistemological "truths" about human interactions and cultures (especially when analyzing historical written text). Essentially claiming that any pattern found is subject to our current interpretation and has not much to say about how things really are. Has this incredulous opinion changed over the years as the NLP techniques have gotten better and better (especially unsupervised methods which by definition don't contain preconceived biases)?

Also, to add to Thiyaghessan's post, I'm curious about a possible ceiling on the validity of these approaches for extracting meaning out of transcripts when the usage of irony or sarcasm rely heavily on non-verbal cues such as facial emotions and "delivery" (as comedians would say).

linhui1020 commented 2 years ago

Hi Prof. Evans and classmates,

Prof. Evans, thanks for sharing this article. It provides an exhaustive literature review regarding how previous social scientists employ content analysis to identify interesting findings, and it also provides a framework for the audience to understand the usage of models in different contexts. One of my focuses or question is about the concept or construct that you mention in the paper, "collective attention". How this concept differs from other definitions of attention? Ocasio (2011) categorizes three types of attention from neuroscience, that are, selective attention, executive attention, and vigilance, and in different fields and research contexts that the emergence of the attention concept is divergent and complex, even when using similar technical tools in inference. How to interpret the embedding logic to explain the rationality using ML to investigate the relationship between the attention/consensus/awareness from human beings and an outcome?

Another question is since some textual resources such as journals and books are contributions of a team or a group of persons, how to identify the consensus, bias, or attention of individuals within a collaboration project? Also, when some textual resources are the consequence of negotiation among different entities, such as law, policies, decisions, what kind of model could be used to identify through eventual textual resources but without actual observations of the processes or experiment, maybe through conference memos?

Thanks a lot

sizhenf commented 2 years ago

This paper offers a very thorough summary of text analysis methods and models, and I am very excited about the prospect of applying the techniques to studying a very wide range of social science problems. A potential concern/question that I have is that many social scientists have been using text data from the internet (mostly social media like tweeter and Facebook) to study public communication. Naturally, when we choose a sample to conduct empirical analysis, we would hope that the sample to a representative sample of the population of our interest. But when we use social media data to study a question, say US election, it does not seem to me that social media users (even if constrain them to only US voters) can be a representative sample of the US population. For instance, I would assume that social media users tend to be younger than the average US population. It is also not uncommon that characteristics of social media users (for example, political ideologies, partisanship preferences) change dramatically across different social media. My question is, has this issue been a concern of social scientists? and how do they often adjust for this biasness?

Sirius2713 commented 2 years ago

Hi all. This paper provides a comprehensive review of content analysis in social science. While being excited about the potential opportunities, I'm also concerned about how can people protect their privacy now that everything can be researched from well-curated documents to likes on Facebook? And how should researchers pick up dataset for content analysis to avoid privacy violation?

And adding on sizhenf's question, how can we minimize the biasness in content analysis when the corpus we use may be biased or not comprehensive? Because text data is everywhere and it's hard to get an exhaustive collection of data.

ValAlvernUChic commented 2 years ago

Hi everyone!

While this assigned paper was largely focused on text-based data, I couldn't help but think about memes as a communicative platform (in some of my circles it's the main communication method, unfortunately). Intuitively, negotiating textual cues with the often contextually-saturated images that they caption would seem extremely difficult. At the same time, in a digital age when these mediums increasingly dominate political and social discourse, they seem incredibly important to theorize. A paper by Dimitrov et al., "Detecting Propaganda Techniques in Memes", did something interesting but they still relied on manual annotations of memes before using them. I was wondering about 1) the advances in methodological approaches in this multi-modal space - specifically whether anyone has had success with unsupervised approaches and 2) How we might even approach this if the context is so latent - I know why Will Smith is crying in that photo and the type of text most appropriate for it, but how can we get the computer to know too?

Thanks all! :)

facundosuenzo commented 2 years ago

The paper is fascinating, clear, and pedagogical. One of the things that resonated with me was the back and forth with what the authors call the "underlying social game" and how researchers become indispensable as mediators. This appears interesting for sociological purposes because it restores social scientists some of their authority (or at least considered at the same "level" as their data or methods). I wondered how this iteration and the inclusion of these methodologies could help expand other processes in a research design. For example, can we develop a better and more systematic literature review that accounts for the "holes" in the field using papers and books as texts? But also, what could happen if we decide to create an algorithm that analyzes in-depth interviews? In other words, can these new techniques impact "traditional" ways of doing sociological analysis?

yujing-syj commented 2 years ago

Hi everyone!

This paper gives us an overall summary of the computational approaches and theories to do the text analysis, as well as the how these tools are applied in the field of content, process, and states. I especially love the part of applications, which tell us the power and usefuleness of this method. After reading this paper, I have few questions: 1) when we want to analyze the process of communication, how can we ensure the accuracy of the result? For example, McFarland et al.(2013) explored moves during the game of courtship by analyzing audio and textual content from speed-dating encounters. I think using the prosodic attributes face lots of noise of the data because these are also effected by the recording mechine, race and other factors. 2) From my understanding, most of the studies focus on detecting and analyzing the underlying relationship behind the content. Whether there are some applications that are more practical in social science field? A very rough though is that researchers use the content analysis to simplified some complex but formal documents such as legal documents by the similar syntax, and transfer them to something that are easy to understand by layfolk, which could save time and cost for policy makers and everyone.

Thanks!

Emily-fyeh commented 2 years ago

Hi everyone,

This is a well-constructed paper precisely outlining the methodologies of content analysis in the social science field, which is very clear to me. (especially for the dichotomy of the confirm/discover theory) My question is when trying to confirm a theory, how do we (systemically) search for the optimal supervised methods (and rule out the infeasible ones). Or, how do we justify the choice of a certain ML method that is legitimate enough to validate the theory? That is, are there any thumb of rules to find a good marriage of research questions and methods? In my opinion, in most content analysis research, the interpretation part can be more critical than the (statistical) significance. And perhaps when using unsupervised methods in an attempt to discover theories, the construct matters even more, since the researchers can operationalize concepts within a theory and decide the premises and generalizability.

Thanks all!

melody1126 commented 2 years ago

For the third way in which computational tools contribute to social discoveries and theory formation, which level of research would allow us to draw stronger inferences about social states – the individual level or collective level? Are they different? (pages 41-42) From the first discussion of the of using computational tools to find out about collective attitudes, it seems that big text data is pretty good at tracing social attitudes at the collective level. Would the same be true for sentiment analysis?

hazelchc commented 2 years ago

Hi everyone! I have two questions:

I'm curious how recent NLP tools deal with cultural expressions. In particular, some movie characters, celebrities, or historical figures are associated with specific, collective impressions by people of a certain culture. Sometimes their names are mentioned to refer to the underlying cultural/ historical meanings. Is there a way that machines can interpret those names accurately?
Social data (especially data from social media) can be loaded with junk and spam. If they are not carefully handled, research outcomes can be completely altered. For instance, in Back and colleagues’ (2010) study, messages generated by a single pager posed a huge influence on the findings. However, it seems difficult and time-consuming for researchers to identify them. I'm curious about whether there are better ways to address the issue.

YileC928 commented 2 years ago

The paper provides a well-structured and in-depth literature review of computational content analysis in social sciences. After reading it, I had a much clear understanding of the research contexts, methodologies, and applications. There are two general points that I hope to learn more about:

The paper mentions “unreported and statistically unaccountable data mining” on page 25. What would be an example of that and how could we avoid that? What are the criteria of “statistically accountable data mining”?
The paper divides current literature into three categories, and each presents a level of social inference. How could they interact with each other? Is it possible to combine two or more types in one study? For instance, when studying information diffusion, what are the possible ways to model the aggregation of the micro-level processes into collective communication behavior?

NaiyuJ commented 2 years ago

I'm thinking of two questions when reading this paper: 1) How can we better utilize machine learning & NLP methods and make adaptations when we work on different languages embedded in different cultural environments. For instance, when doing political science research on China using "text-as-data" methods, we are often concerned about how to detect the underlying true meaning behind the words. Unlike English, there may be some implicit implications of expressions in Chinese. I'm wondering how scholars adjust their methods when there are cultural differences in their contexts. 2) Are we able to mix supervised learning and unsupervised learning? I'm thinking that sometimes we want to both learn and discover something on our dataset. Can this be possible?

konratp commented 2 years ago

Hi everyone!

A lot of you have posted very interesting questions already, and I’m quite excited to learn with and from all of you this quarter! I am interested to see where the limits of ML and other approaches are in text analysis projects, and I wonder how much those limits have evolved since this paper has been published. As others have mentioned before, I can’t really imagine how ML approaches would accurately account for comedy and satirical speech, let alone memes that are highly context-specific. Yet, with the rise of Twitter, Tumblr, TikTok and others, these more abstract forms of communication become increasingly prevalent. I suppose my question is, is there any research that tackles these more elaborate forms of communication that so heavily rely on context to get any kind of message across?

AllisonXiong commented 2 years ago

Hi everyone!

I think this comprehensive review serves as a great orientation to the course. It offers a well-structured introduction to and taxonomy of current methods/tools, types of data could be obtained from content analysis and sociological questions could be answered by content analysis. I do have two relative thoughts or questions:

(1) Adding on to isaduan's question, if there are few good models for higher-level linguistic discourse (sentences, paragraphs, etc.), how can we embed and understand text under a certain 'context'? Even with exact same wording, text can have different meaning or purpose under varying context. Someone mentioned the detection of irony and that's one great example of my question. I think the relation between higher-level text is necessary for that.

(2) The article mentioned that more state-of-the-art ML and neural network models are applied to content analysis, adding accuracy and reliability to the results. To my knowledge, there is a tradeoff between accuracy and interpretability for statistical models. How would researcher harness complex neural network models and get theoratically valuable results from content?

chuqingzhao commented 2 years ago

Thank you for sharing this great article! Here are questions that I hope to learn more about:

I wonder whether there is potential bias in applying NLP methods to analyze social data? And how should we deal with the bias? The article mentions several different word embedding methods and its word analogies. When implementing human analogical reasoning tests in word embedding models, the benchmark test set could be biased reflecting human discrimination such as racism and sexism. One example of human-like biases is that word vectors related to female are more likely to assign to kitchen and arts while men are closer to science and engineer. In this case, how should we detect and intervene the bias in our text and algorithms? What's ethnical perspectives should we be careful?
Another follow-up question about bias and social effect of algorithm. I also wonder when researchers collect data from social media whether and how they should get rid of the effect of algorithms that have been embedded in the platform design? Many studies mentioned in the paper collect texts from social media platforms. For example, Bail finds a linguistic patterns to maximize engagement (page 36). Given the wide application of recommendation systems, I am skeptical whether the engagement pattern is biased because platforms are designed to send popular social media messages.
How should we analyze videos or images embedded with text information? And how should we integrate different information clues together, such as visual image with subtitles, short videos with bulleting chats, GIFs?

hshi420 commented 2 years ago

Is it possible in a research that a clustering algorithm yield several clusters that are aligned with a theory, but the mechanisms are differet? For example, the algorithm and the theory use different features. Or the features that the algorithm uses are variants of the features used in the theory, because it is impossible to collect data of the features used in the theory. If this kind of situations exist, how should we deal with it?

zixu12 commented 2 years ago

Hi all, I have similar questions as posted by fellow students such as the "unaccountable statistical data mining". Here are some other questions I have in mind while reading this paper: In figure 4, computational text analysis are categorised into three categories: "(a) collective attention, framing, and thinking through the manifest and latent content of communication; to (b) social relationships through analysis of the process of communication; and ultimately (c) social identities, states, roles, and moves through linguistic signals embedded in communication." I believe that sometime the understanding of (a) is contingent on the understanding (b) and (c), that is to say, (a) might be different under different context of (b) and (c). Is there any research/paper on that as well? Thank you!

LuZhang0128 commented 2 years ago

This is an amazing article. Although claimed in the article that the NLP and ML tools are more accurate than they previously were, the accuracy in, for instance sentiment analysis, is still not super high. I wonder for future directions, are current researchers focusing on developing sophisticated algorithms (things like neural network vs traditional approach), or are they focusing on a better understanding of the linguistic structure of text?

MengChenC commented 2 years ago

Can you clarify and compare the traditional and modern ways of transforming the text data into an understandable format for a machine? We have lemmatization, stemming, Byte Pair Encoding (BPE), etc., what are the strengths and weaknesses of these methods, and in what circumstances should we choose one over the others? Similarly, we have counting, tf-idf, PointwiseMutual Information (PMI), how can we compare their capability and usage and then justify our choice?

chentian418 commented 2 years ago

I have two questions:

Is there any evidence of feedback effects of ML techniques of the social games and social games or actors? For example, the ubiquity of online communication, automated speech-to-text translation, and mobile sensors have made these traces available for a much wider range of social games and players, while the social players aware of the traces might want to act against the results of the technologies and therefore changes their social actions.
You have mentioned in the articles that data mining has gained a bad reputation in the social sciences, since many see it as synonymous with the practice of algorithmically sifting through data for associations and then falsely reporting them as if confirmations of theoretically inspired, single-test hypotheses. And I am curious how do we determine whether the results from data mining and ML techniques are real associations and how do we link the results to confirm statistical hypotheses? Thank you!

97seshu commented 2 years ago

Hi all,

My question is:

With a bunch of available ML models to choose from, and some models can produce more desirable trends than others (even the same model can produce different outcomes depending on the hyperparameters tuned - "Different clustering rules produce different clusters, which in turn reveal different social games from the data"), how can we possibly overcome the bias in which researchers only select models/hyperparameters that create outcomes that fulfills their beliefs?

Thanks,

Hongkai040 commented 2 years ago

This is a great review article and I think most part of the paper are clear so I understand what the paper conveys. However, there're two things confuse me. The first is about extrapolation. From my understanding, techniques like ML perform poor at extrapolation. So what does 'extrapolation' mean when extrapolate into full data in figure 2? Another question is about the three types of text analysis. They're categorized according to 'the depth of inferences they make about the social world'. Does it also mean that it's harder to analyze as the depth of inferences increases?

kelseywu99 commented 2 years ago

The article gives a comprehensive review of several computational approaches to text analysis and touches the ground with the fundamentals of the course. Some of the methods can be expanded to research being done in the field of digital humanities, as noted at the end of page 34, that previous research has been done on genre classification on archives and structure summaries on scientific literature. I was curious what are some other cases when tools and methods for text analysis may be applied to archive study as well?

hsinkengling commented 2 years ago

The article mentioned that most of the research using text analysis methods are done by computer scientists rather than sociologists. What do you think are the biggest obstacles for sociologists coming into text analysis? (or what do you anticipate to be most difficult about this course for students coming from social science)

sudhamshow commented 2 years ago

The reading provides us with plenty of references to previous research that has used text mining to understand social theory and analyse social games through qualitatively coded text data. However, with newer and more creative means of interaction today (through gifs, memes and stickers) I believe measuring true intent and meaning would be difficult to realise unless the learning method realises the context the communication was made in. Paragraph in P5 suggests that with greater volume and variety of data, learning models can now be made to generalise in different contexts. Is it possible with the current state of data mining and natural language processing to infer the referred context and the actual meaning behind such messages, with missing/without prior expert knowledge?

weeyanghello commented 2 years ago

My training is in semiotic anthropology and sociolinguistics, and it has been immensely interesting getting to understand communication from the computational linguistics perspective. I have some immediate questions after reading this article. The first has to do with the interesting conundrum of human consciousness. Wherein does human consciousness arise, or how does computational linguistics afford a different perspective of human consciousness, given that many of the general social theory produced through machine learning/mining presume that these are social patterns that somehow lie beyond human self-consciousness, such as noticing "previously unnoticed genre of partisan taunting" (p. 32)? In these cases, to whom are these sociolinguistic phenomena "unnoticed", the scholars analyzing the data, or the officeholders performing the action but are not aware that they are performing the genre? A second question I have is regarding the ontological status of these social theory formed as part of conclusions of data mining—are they treated as a simple reflection of some social reality "out there" in our lived world, i.e. uncovering/discovering some fact of the world that we were previously unaware of? Or are they treated as provisional interpretations themselves that are locally motivated and mutable? If it's the latter, what factors might change the (meta-)interpretation of machine-mediated analysis of some set of data?

UChicago-Computational-Content-Analysis / Readings-Responses-2023

1. Measuring Meaning & Sampling - oritenting #54