Measuring Meaning & Counting Words - Orientation

jamesallenevans commented 4 years ago

Post questions here for one or both of week's orienting readings:

Evans, James and Pedro Aceves. 2016. “Machine Translation: Mining Text for Social Theory” Annual Review of Sociology 42:21-50. DOI: 10.1146/annurev-soc-081715-074206

Michel, Jean-Baptiste et al. 2010. “Quantitative Analysis of Culture Using Millions of Digitized Books.” Science express, December 16.

wanitchayap commented 4 years ago

Jean-Baptiste et al. 2010

The authors said, "We restricted n to 5 and limited our study to n-grams occurring at least 40 times in the corpus." I can see that this leads to no problem in English because there are so many resources in English in the corpus. However, would this criterion lead to very sparse eligible n-grams in a language like Hebrew where they reported only has 2 billion words in the corpus? Wouldn't it be better to adjust the criterion differently in different languages? Or would doing so limit the ability to compare across languages?
I find the result that English lexicon counts doubled in 50 years very fascinating. However, I also find it still hard to believe. Could this increase in lexicons come from the increase in writers and genres? It could be that some lexicons were already out there in verbal communication. They were just not recorded. In addition, there could be other confounds like how much the text records in the past left in comparison to those from the more recent past. In short, I am not yet convinced by this result.
For the change in regular/irregular verbs, the author said there was no characteristic shape in the change. I think it would be interesting to incorporate this text change with phonetic characteristics since phonetics could be the factor as well (according to linguistic literature in overregularization in young children).

DSharm commented 4 years ago

From "Machine Translation: Mining Text for Social Theory"

This may be a rather basic question, but I was wondering what tools / methods / corpora are available to researchers when analyzing text such as that from Facebook messages, tweets, texts, emails, etc. That is - we saw in Figure 1 (the sentence on Trayvon Martin's shooting) that we have tools to tag parts of speech, but that assumes proper English grammar / spelling - right? In cases where we have chat-message language (e.g. "lol", "idk", "i dunno", swear words) what tools can be used to glean meaning from those? I'm sure it's doable and has been done, since the reading mentions many such examples, but I wasn't sure how one might go about practically doing that.

nwrim commented 4 years ago

For Jean-Baptiste et al (2010):

I think this kind of rise/decline explanation in a time-series can be quite subjective. For example, in Fig. 1b (blue vertical line added by me - sorry for messy matplotlib image) there clearly is an increase that matches the one the authors noted as related to civil rights movement, but was not mentioned in the article. I wonder if there is a way (more objective for the lack of better wording) that detects significant changes in time-series like this.
When counting human names like this article, I think it might be unfair to use the full name (Nak Won Rim) than to use the last name (Rim) for some people. When I mention Claude Levi-Strauss, I never say his first name out loud - I just say Levi-Strauss because it becomes too wordy (an equivalent example could be that we sometimes refer to Gabriel Garcia Marquez as Gabriel Marquez even though Garcia is a part of his surname). Also, I think there is a paradox that if a figure is connected to a name very closely and the last name becomes an index for that person (e.g. when I say Kant, I am almost certainly referring to Immanuel Kant - same goes for Foucault, Shakespeare, Marx, Hitler and so on) than the first name of that person is not as often mentioned as before. But on the other side, Professor Andrew Ng's name will pop up way before he was born in the corpus if we only use the last name. I am wondering if there is a clever workaround with this.
I am curious why the authors mention that

over 500 billion words, in English (361 billion), French (45 billion), Spanish (45 billion), German (37 billion), Chinese (13 billion), Russian (35 billion), and Hebrew (2 billion) (p.176)

but later have to estimate the proportion of English 1-grams in the corpus. Is there a nuance that I am missing behind why the authors came up with the proportion of the languages in the overall corpus but had to estimate the proportion of 1-grams in the overall corpus?

The authors define usage frequency as

dividing the number of n-gram in a given year by the total number of words in the corpus in that year (p.176).

I was wondering why the authors chose to divide it by the total words, instead of the corresponding n-gram in the corpus that year. Wouldn't using word count uniformly for all frequency problematic when we compare the frequency of n-grams with different n?

WMhYang commented 4 years ago

Evans and Aceves 2016

It is mentioned that unsupervised methods are often used to discover new categories or properties from text. However, analysts often need to sample, peruse, and critically interprete the results from the unsupervised methods. My basic question is that to what extent should we trust and utilize the results from unsupervised methods? Moreover, how could we believe that the model does provide us with new patterns in the contents?

timqzhang commented 4 years ago

For "Machine Translation: Mining Text for Social Theory"

My basic question is on the analysis of social inferences via the communication texts, namely in Figure 4 in the paper. I wonder whether some communication texts with special patterns of language, i.e. slangs or dialects in certain region or just among small group of people, could be identified and processed with their true meanings? Such a case may occur when social actors set their own "secret" expressions among them, which look similar to common expressions, but have their own meanings. I am not doubting the strong learning ability of machine learning, but it has the probability that the emotions/attitudes/stances may get wrong, which would lead to misleading conclusions on social inferences. Are there any method/tools to focus on this scenario? I'm quite expecting some powerful tools to accurately make inferences from texts.
It is mentioned that the data mining has acquired a bad reputation in social science. Indeed, many doubts are posted to some concerns like overfitting of models, or the potential p-hacking problems. While I strongly agree with the point that, computational content analysis should and will contribute a lot in exploring sociological topics given the explosion of text nowadays, what improvement of methods/algorithms has been made, in general, to tackle the problems like overfitting to justify models and convince people?

iarakshana commented 4 years ago

For "Jean-Baptiste et al (2010)" :

The authors discuss censorship to a certain extent when they talk about revealing censorship by comparing German and English texts as far as Marc Chagall is considered, for example. However, I think there is an element of censorship they cannot possibly cover at all within a certain language given changes in history and potential acts of suppression. It is likely that certain individuals were completely erased from history which they can't corroborate with comparing to other language histories, which is something they cannot reconcile through this analysis, I don't think.

Also focusing solely on print media is also a potential issue as they do acknowledge that they should include other artworks and newspapers for example which might be more indicative of culture at the time.

mattconklin commented 4 years ago

Jean-Baptiste et al. 2010

The authors show that quantitative content analysis methods can be used to detect censorship and repression. This is done by comparing the prevalence of author names in English books to German language books during the Nazi regime. The results provide a clear example of how the effects of government censorship can be identified in comparative historical contexts.

Building on this example, my question relates to this phenomenon in reverse. That is, can the same method be adopted to identify the effects of government propaganda? For example, following WWII, America’s education and government officials worked to foster civic values and a sense of “international mindedness” among the public. The main avenue for this effort was reform of education curricula. To trace the impact of such reforms, it would be interesting to compare the prevalence of certain “pro-government policy” terms in textbooks and other relevant sources to those in the preceding era. Relatedly, are these tools better suited to study the prevalence of censorship and/or repression in autocratic compared to democratic regimes?

tianyueniu commented 4 years ago

For "Machine Translation: Mining Text for Social Theory"

Following @WMhYang 's question on unsupervised learning, it is mentioned in the paper that 'For example, topics produced by a probabilistic topic model estimated on a corpus are scrutinized, then explicated, and hand labeled for easy description and reference'. Given that the topics generated may be highly interdependent or highly correlated with one another, what is the general criterion that people follow to 'hand-label' or interpret the different clusters? Can we only rely on intuition, or should further analysis be conducted to corroborate the results of unsupervised modeling?

harryx113 commented 4 years ago

"Machine Translation: Mining Text for Social Theory"

Is there a set of rubrics in choosing the ML models in NLP? I had a conversation with an NLP engineer and he told me choosing models is an "art" due to personal preferences and most companies rather use simpler and more interpretable approaches. How does academia view the tradeoff between accuracy and interpretability?

Lesopil commented 4 years ago

For “Machine Translation: Mining Text for Social Theory”

Coming from the background of a historian, I have encountered the digital humanities and digital history in the past. From my encounters I understood that these were very contentious fields, with proponents arguing that digital was the next stage of historical work and would make all other history obsolescent, and opponents questioning the ability of digital techniques to reach any conclusions at all, let alone conclusions rivaling those achieved with analog techniques. This is brought up several times in the text (p.25 as an example), but I am wondering how you would justify using digital/computational approaches over the traditional analog methods available to researchers?

linghui-wu commented 4 years ago

For Quantitative Analysis of Culture Using Millions of Digitized Books,

Without a doubt, it is an amazing work that broads my mind on how many aspects of content analysis can touch by analyzing the corpus of ebooks. My questions are as follows:

The corpus includes over 500 billion words in English, French, Spanish, German, Chinese, Russian, and Hebrew and the following study is restricted to "how often a given 1-gram or n-gram was used over time". Since the authors compared the frequency of "Tiananmen" and "天安门" in the research afterward to investigate the government censorship, I may assume that the k-gram partition also applies to Chinese text. However, unlike the other seven languages, there is no white space in a Chinese sentence within characters, and sometimes words with different lengths may refer to the same k-gram English word. So I am curious how technically speaking, researchers split Chinese text into 1-gram and n-gram?
I totally agree with @nwrim's second point on calculating the fame of a person by simply counting his or her full name. In addition, famous people will not only be referred by their real name but also his or her nicknames, which may frequently happen in foreign media coverage due to translation.

minminfly68 commented 4 years ago

Machine Translation: Mining Text for Social Theory This paper provides an excellent overview about the methods available in NLP and ML. It reviews current research that employing content analysis methods in social science and stress the power of content analysis in social science. The article mentioned that "distinguishing dialects through phonology can reveal distinct social worlds underlying spoken interaction." (P28), which drove my question that whether text analysis can be used on processing messy text (i.e. different dialect, typos, Anglo-Saxons English, hashtags, emojis, etc.)? Are there any clear cut-off for processing or not? Can we apply (un)supervised model to them?

liu431 commented 4 years ago

"Machine Translation: Mining Text for Social Theory" This is a fantastic literature review on text mining for social data. I am wondering what is the limitation of this approach? I am concerned that the digital and public data might be biased and unrepresentative of the population. For example, customer reviews are highly skewed, and the opinions within the reviews might be very different from the survey interviews.

bazirou commented 4 years ago

For Machine Translation: Mining Text for Social Theory:

I wonder if content analysis can break through the boundary between different languages, which means we can gather text from different languages and analyze them together.

ihsiehchi commented 4 years ago

Jean-Baptiste et al. 2010

Elegant variation: I wonder if people get tired of using the same phrase repetitively and therefore adopt elegant variation. For instance, I've seen many ways to address the current COVID-19 situation. On top of my head, I can think of pandemic, epidemiological crisis, health disaster, etc etc.
Who are the authors: I was considering what implications the answer to the question, "who are the authors?", may have on the observed trends. To be exact, as higher education becomes more accessible, I would expect a trend of publishing becoming less of an activity for the elite, what does that say about phenomena such as -ed becoming the dominant rule.

Yilun0221 commented 4 years ago

For Machine Translation: Mining Text for Social Theory:

I am very interested in natural language processing. I think this greatly expands the scope of our research and also enables us to better understand human language and thoughts. My undergraduate thesis is to use the technology related to natural language processing to analyze the protagonist and the storyline in this novel. I look forward to studying this semester! For this paper, I have the following thoughts:

As for the limitations of text mining mentioned in the paper, I highly agree. Especially, when I was working on my undergraduate thesis, I found that the existing natural language processing technology is very difficult to analyze the text at the literary level. It can extract some information from it, but it is difficult to dissect the literary text from the context level and literary skills. Do you think there are other reasons besides the current natural language processing technology mentioned in the article can not be expanded at the social level? Where should the main breakthrough point be?
With the development of society, some new words will appear every time. I want to ask, how does the current natural language processing technology deal with these new words? Does it mainly rely on supervised machine learning that artificially inputs new words into the corpus and assigns attributes?

Computational-Content-Analysis-2020 / Readings-Responses-Spring

Measuring Meaning & Counting Words - Orientation #1

For Machine Translation: Mining Text for Social Theory: