Measuring Meaning & Computational Introduction - Orientation

HyunkuKwon commented 3 years ago

Post questions here for this week's orienting readings:

Evans, James and Pedro Aceves. 2016. “Machine Translation: Mining Text for Social Theory” Annual Review of Sociology 42:21-50.

william-wei-zhu commented 3 years ago

Can computational methods detect irony in text?

jacyanthis commented 3 years ago

Generalizing @william-wei-zhu, what are the current limits of measuring meaning through computation? Can we capture semantics more complicated than keywords and their proximity in text? Attempts such as classifiers based on hand-coded samples and transformers with intermediate representations of complex features still seem much more limited than human reading.

hesongrun commented 3 years ago

Thanks for the wonderful review paper! Figure 3 is very intuitive. I am wondering if there are some guidelines for choosing the representation of texts and NLP models in a weak data scenario? I come from a finance background, and in our field, textual analysis is mainly a dictionary-based counting method, e.g. Tetlock 2007 and Loughran and Mcdonald 2011. One basic reason is the financial market has a very low signal-to-noise ratio. The dictionary is a convenient way to incorporate strong and reliable prior information. However, this approach is very limited, and the choice of dictionary can be ad-hoc. In this low-signal-to-noise ratio context, are there any good unsupervised methods that allow for summarizing information from texts in a parsimonious way for later descriptive or causal analysis? Thanks!

toecn commented 3 years ago

Two sets of questions: 1) Evans and Aceves (2016) wrote their piece about five years ago; what are the most exciting advances in computational research since, and what are the major challenges computational work faces today? 2) As described by the same authors, Nelson's (2015) research combines in interesting ways unsupervised topic modeling and qualitative work; what have been and/or could be some other interesting ways of combining computational methods with qualitative work, more concretely, with field research (interviews, ethnography, ...)?

Raychanan commented 3 years ago

As we expected, text mining and computational content analysis in many ways far exceeds traditional qualitative analysis. Even more, text analysis with the help of computers directly enables capabilities that traditional analysis does not have. I think the review of traditional qualitative analysis and the compliments of the latest computational text analysis near the end of this paper prove this point.

My question, however, is whether there is a possibility that there are things (such as ways of thinking, ideas, etc.) that computational content analysis cannot distill and generalize? Does the traditional qualitative analysis of texts have a superiority that computational methods cannot surpass? For example, the more abstract and subtle kinds of logical relationships between texts.

@william-wei-zhu I've searched for this question before, and from what I understand, it is indeed an important one that researchers are dealing with. One thing I find interesting is that sometimes even humans themselves can't tell if certain words are sarcasm or not. I think this problem is further complicated when we think about the fact that machine learning is built on a dataset provided by humans.

RobertoBarrosoLuque commented 3 years ago

The reading discusses different methods/approaches to document-clustering: classic algorithms such as k-means, more sophisticated ones such as DBSCAN, soft clustering methods such as LDA, etc. When trying to optimize and evaluate a chosen model how should one reconcile the gap between different clustering metrics (homogeneity score, silhouette coefficient, coherence measure (for topic models), etc.) and human interpretability? In a research setting what approach should a researcher follow to draw insights/ make conclusions from the results of such algorithms ?

theoevans1 commented 3 years ago

I'm curious about the works mentioned that look at language changes in online content over time. Given that pages can be continually updated or edited, in what ways can time be incorporated as a variable?

egemenpamukcu commented 3 years ago

With the ubiquity of textual content to be analyzed in the digital space and increasing number of machines scraping through this content, I think it is interesting to think about this: can the mainstream audience of written work shift from humans to machines? Should humans in the future, while producing written work (academic publications, blog posts, news articles, social media posts etc.), be thinking of machines when considering the "readability" of a text by making it more interpretable for algorithms? New and more advanced algorithms are being built to extract the true context and meaning of written content with some degree of accuracy. Perhaps to increase that accuracy, and considering a text may indeed be read more by machines than actual humans, changing our style of writing to better fit NLP algorithms may help with the diffusion of an idea or an opinion.

This may sound a bit farfetched and perhaps grim but it is interesting to think about.

yushiouwillylin commented 3 years ago

When researchers analyze text data, do they take strategic behavior (lying, distorting facts etc.) into account?

For example, many papers found the cohesiveness in text among same party politicians. Is this the end of the line? Are there methods that dig into whether this is a deliberate framing strategy chosen by politicians, or is it a phenomenon that would unconsciously appeared throughout democratic process?

Essentially I guess the question is, do researchers just take the text as given, or do they take some with a grain of salt and try to detect strategic texts?

medcar8879 commented 3 years ago

Related to Figure 3 (unsupervised methods): is one of these methods best-suited for tracing public sentiment across time?

chiayunc commented 3 years ago

In the article, we see a lot of past researches using millions of books/ texts/ articles to illustrate discursive change. Does computational content analysis always rely on corpora that large? Are there examples where a limited number of texts are used? I gues in this way, it would be hard to come up with meaningful word embeddings and perhaps net work like approach would be preferred?

xxicheng commented 3 years ago

What are the differences between supervised and unsupervised learning?

A short thought on @william-wei-zhu and @jacyanthi's question about irony detection in computational methods: the detection ability depends on the ability to "understanding" the context. In this article, the authors proposed a supervised learning method on irony labels enriched with knowledge transferred from external sentiment corpora. They found the results driven by their algorithm are even more accurate than by human beings. Some other methods, like text categorization, relying on explicit expressions for detecting context incongruity are also mentioned.

jinfei1125 commented 3 years ago

I guess my question and concern fall in the stereotype mentioned in the article "Data Mining has a bad reputation in social scientists." When we studied data scraped from the internet, for example, from Reddit, how can we mitigate the impact of confounding variables? Because we don't know the characteristics or demographics of users who type these words. Because we can't control those variables, our research can be biased. In my last quarter's MACSS Perspectives course reading, I remember there is a saying "There is no free lunch--if you don't spend a lot of effort to collect data, you need to spend a lot of effort to clean data." For the data scraped from social media like Reddit or Twitter, are there common methods to clean our data and prepare them for social research?

keiraou commented 3 years ago

Thank you for sharing the reading materials. I am wondering: What are the major difficulties in developing models for higher-level linguistic discourse, such as paragraph or article analysis? what are the major recent improvements?

jcvotava commented 3 years ago

I'm wondering whether there are any linguistic, cultural, or epistemic constraints inherent to "text" as a form to study. For instance, the paper mentioned that spoken language/phonology can differ in many meaningful ways from written text. Has there been work or theorization done to describe what relation different kinds of written text have to actual psychological states? (Obviously, this is somewhat dependent on the text in question; for instance, a corpus of emails should be understood as a certain kind of communication carrying certain latent socio-cultural linguistic connotations. What I'm trying to get at is maybe one level more abstract than that - like whether there are any epistemic issues per se to the analysis of text as a linguistic form.)

chuqingzhao commented 3 years ago

Thank you for sharing! It is an interesting and informative article that helps us thinking computational content analysis and social science research as a whole. I have a few questions below:

I am curious is whether there is possibility to combine different kinds of data (like images and videos) with text data? Given images and text are interrelated in certain situation like Instagram, I suppose if social scientists could look into the relationship between texts and other data, it will generate richer social information.
I am wondering how deep computational methods to process and understand text information today? Since communication does not only consist of verbal information, but most information comes from non-verbal interaction. Can machine understand some sophisticated rhetoric devices like metaphor in the text? Similar concerns with @Raychanan and @william-wei-zhu , how can we use computational techniques to understand social games through those underlying information?
As the paper have mentioned one of the limitations of sociological studies is "most NLP resources are only available for high-resources languages, and not for low-resource languages", I am wondering how does researchers try to solve this challenge nowadays? Instead of manually building corpus for low-resource languages, what's kind of computational methods can be applied?

k-partha commented 3 years ago

Much of the complexity in language and social games is only understood in light of (largely hidden) common cognitive rules/psychological drives that we share but perhaps cannot always directly express through text. Do you find word embeddings to be particularly susceptible in failing to account for this source of meaning? Do you see unsupervised methods as a practical/feasible strategy to uncover distinct cognitive drives underlying language use?

romanticmonkey commented 3 years ago

Since we are talking about NLP and parsing methods, I wonder what's everyone's view on semantic parsing. I heard from my NLP professor that the development of relevant semantic parsing methods was somehow thwarted by the rise of neural networks. Anyone still thinks this is still a field worth investigating into?

A perhaps useful add-on to @william-wei-zhu 's question: There are a number of studies working on satirical news, identifying them and pulling them apart from fake news. E.g. this one: Ravi, K., & Ravi, V. (2017). A novel automatic satire and irony detection using ensembled feature selection and data mining. Knowledge-based systems, 120, 15-33. Perhaps this line of research can shed some light on the irony detection problem in general.

Bin-ary-Li commented 3 years ago

Speaking from the perspective of a psycholinguist, I wonder what kind of a role sociologists should play in the application of NLP/ML tools in textual content analysis. I can think of two ways that sociologists can make use of the NLP tools. One is to apply them as-is---the role of a user; the other is to interrogate them and their assumptions---the role of an examiner.

Many models can be easy to implement once the analysis pipeline becomes mature. They can be as simple as clicking buttons, calling functions, and adjusting hyperparameters. But should sociologists be content with just running the pipeline and reporting some accuracy values? For example, linguistic pragmatics and textual analysis with theory of mind are some areas where NLP has failed to make any big progress. Should sociologists be more devoted to helping CS people solving that?

ming-cui commented 3 years ago

I am wondering if computational content analysis can be used to design and run large-scale experiments. Is it possible to divide groups into treatment and control based on the results of content analysis?

zshibing1 commented 3 years ago

To what extent is the content analysis sensitive to the language being analyzed? In other words, what would be different if one wants to analyze texts in, for example, simplified Chinese or French?

sabinahartnett commented 3 years ago

I am wondering what it means to implement any of these mechanisms/methods on a corpus of text where the researcher does not know exactly what is human created and what may have been created by a machine itself (reproduction via AI on click bait news sites or social media bots)? Similar to @egemenpamukcu 's question, is it possible that a topic modeling algorithm may sort machine-created content differently?

dtmlinh commented 3 years ago

I think maybe I have 2 questions: 1) Is there an issue of "missing" data with NLP? And relatedly, how/can sociologists use NLP to analyze issues such as censorship? 2) The paper mentioned "high-resource languages" and "low-resource languages" and brought up the fact that linguistic patterns can differ across different languages. I'm curious if you have an interesting example where the researcher explored the same research question on the original text and the translated text and maybe reach the same/different findings.

Rui-echo-Pan commented 3 years ago

I am very interested in social relationships through the process of communication. I wonder how the external validity of the social relationship, especially power, is accessed as normally the specific corpus we use is focused on a small group of people.

acozzubo commented 3 years ago

I liked how the paper does a general overview of the topic without losing touch with the sociological component. My question here is what would be the recommendations for an NLP-analyst to learn from the linguistic theory? Is there any set of topics (from semantics, phonology, etc) which an NLP-analyst should at least be aware of, before starting the analysis? I would say semantics is a must, but I do not have an informed opinion (and would like to form one with the course content!)

jetienne6 commented 3 years ago

When collecting and analyzing social media data, how are ethics evaluated in this research? Are the users considered "participants"? How are they anonymized? Is it enough and does this pose challenges for interpreting social meaning?

MOTOKU666 commented 3 years ago

I'm wondering how to deal with the confidential issue in content analysis. For example, if we are dealing with a sexual-related education problem and collecting information in a small county(While it's still big enough for us to perform this research), Is it necessary to "blur" some information prior to the research to protect the participants? In another word, would moral issue be a problem here?

YijingZhang-98 commented 3 years ago

Each person's speaking or writing style is different, how can we get general results from the content analysis?
I was wondering can we apply the content analysis to detect persons' mental health conditions? Would it raise ethical problems?

dtanoglidis commented 3 years ago

Reading this interesting article and the shift from "manual" to "automated" content analysis, I was wondering: Has there been estimated (using information theory or something similar) the amount of information present in textual data, otherwise available to a human researcher, that we are losing when automated/computational content analysis techniques are being used?

vfuentesc commented 3 years ago

The paper was really helpful to have a broad overview of the course. About the question, I was curious about how do NLP techniques deal with jargons, typos, non-word errors, short forms, etc? And how does it vary across languages?

mingtao-gao commented 3 years ago

My question is also related scraping data from social media. What are some procedures we should take to protect users' data privacy while getting enough data for our research?

UChicago-CCA-2021 / Readings-Responses

Measuring Meaning & Computational Introduction - Orientation #1