Sampling, Crowd-Sourcing & Reliability - (E2) Dodds et al 2015

jamesallenevans commented 4 years ago

Post questions here about the following exemplary reading:

Dodds, Peter Sheriden et al. 2015. “Human language reveals a universal positivity bias.” Proceedings of the National Academy of Sciences 1112(8):2389–2394, doi: 10.1073/pnas.1411678112

wanitchayap commented 4 years ago

The authors chose 10 languages to conduct the content analysis and concluded a universal positive bias in natural language. However, these 10 languages are all common languages. That is, they have been spoken by many people, they are well preserved and developed through time, they have written forms of the languages, etc. I guess this would be an example of a convenient sample, which I think appropriate considering the cost of collecting more corpus of rarer languages. However, I think it is possible some rarer languages--with way fewer speakers, with no written forms, etc.--might be different. It could be that this bias really is universal, but I don't think we can be so sure. For example, we used to think to have a number system is very natural. However, turn out there are some rarer languages without a number system or a number system but without a concept of infinity. I think this paper is a great paper, but I don't think they can claim their results to be totally universal.

(Adding some thoughts after reading @iarakshana and @nwrim) Do you think using Standford Sentiment Treebank could account for the word-level problems? However, I am still not sure how to incorporate the aggregated sentiments at the internal nodes with the leaf nodes (which I think is similar to what the paper here only accounted for). 1_sh9P4hY6oR0mFbzWkUFtNA

iarakshana commented 4 years ago

For some reason, I did not seem very convinced by the paper and it's findings tbh. I guess they're trying to confirm the Pollyanna hypothesis with data however, they seem to be using frequency counts of words that might not convey what they word by itself conveys when read as a complete sentence. Probably, a stupid example, but if they were looking at the TV corpus and analyzing Friends, Chandler saying 'I'm sooo happy right now' probably shouldn't fall into the positive bucket of words in his context. This is also one of the things I was considering when doing the homework, I guess the part of speech tagging and NER dependencies is supposed to help with context but it still feels like you would miss out a lot on tone and context?

nwrim commented 4 years ago

Although I do agree that words might be having some sort of positive bias and think this is a very valuable tool for large-scale sentiment analysis, I do not think what they discovered can be generalized into the language in general, as they claim (or at least imply). Let's think in a metaphorical space since they use the metaphor of

words, which are the atoms of human language

Can we know how the world operates if we have inspected atoms really long? Let's say inspect every single atom in the universe and say their average weight is 1 amu, or 1.66 x 10^(-24) grams. Can we claim that the universe overall has a "light-weight bias"? As @iarakshana mentions, examining word outside context probably would not give us a good understanding of what communication actually contained.

That being said, my question could be: how can we account for this issue if we want to train a model to predict sentiment using this data? What linguistic models and structure can we incorporate into the prediction models to better predict sentimentality?

tianyueniu commented 4 years ago

This paper provides a really interesting perspective. My question is, would there be a difference in language used 'in public' (e.g. TV, books, tweets, music that are meant to be displayed publicly) vs language used 'in private'? Culturally speaking, I do think people tend to express themselves more positively in public vs more negatively when they are communicating with friends or intimate family members. In that case, what is an effective way to include 'private' words used in research to truly examine biases in human languages?

linghui-wu commented 4 years ago

I feel like the authors draw the conclusion too hastily that "the words of natural human language possess a universal positivity bias" since the 24 corpus is not an unbiased sample of "human language". Apart from the different language usages in public and in private as mentioned by @tianyueniu, there still exists caveats between spoken and written languages even though the research incorporates twitter corpus in order to address the problem. Such distinction is significant because studying other forms of language does not necessarily reveal anything about the "structure of language" as they claim that "such a data-driven approach is crucial for both understanding the structure of language". I do doubt the generalization of the conclusions.

harryx113 commented 4 years ago

I think the relativity of positivity bias is important and should be taken into account. The authors write extensively on an absolute scale of positive-negative spectrum, but for the reasons pointed out by @tianyueniu and @linghui-wu , is it necessary to consider this bias on a relative scale?

Lesopil commented 4 years ago

This is an interesting article, but like several of the other commenters I find the results to be problematic. Looking at the sample size, it seems strange to me that they would chose to analyze only the most frequent words. My immediate question is this, are there more words describing negative emotions than there are describing positive emotions? If so, then words with positive connotations would be over represented in the most frequent words, while words with negative connotations would be underrepresented.

Yilun0221 commented 4 years ago

The interesting point is that this paper examines 10 languages which can be more convincing than just using one language to explore. However, languages develop across time. I wonder whether this paper pays attention to the change of 10 languages in different time? For example, ancient Chinese and modern Chinese are very different. Do they only use modern Chinese? In addition, I also doubt the selection of the 24 corpora, like @tianyueniu and @linghui-wu have said.

lyl010 commented 4 years ago

I am impressed by Fig3. in this work which illustrates words at different usage frequency have the similar distribution over the happiness. And it is also interesting to see this pattern. But I have a question about measuring the average happiness use only one word, and I am wondering will the pattern change if measure it with n-gram?

shiyipeng70 commented 4 years ago

As the Pollyanna Hypothesis defines, "there is an universal human tendency to use evaluatively positive words more frequently, diversely and facilely". This article only involves demonstrating the frequency of positive words use. Thus it may be too rush in drawing a conclusion. Second, when we look at the differences of positive words distribution among these languages, will the distribution of negative words lead to a same ranking or not? As far as I know, Chinese and Korean don't have as many salient positive words as French and English, but they don't have too many negative words either.

bazirou commented 4 years ago

I have a question about measuring the average happiness use only one word, and I am wondering will the pattern change if measure with n-gram?

Computational-Content-Analysis-2020 / Readings-Responses-Spring

Sampling, Crowd-Sourcing & Reliability - (E2) Dodds et al 2015 #8