Open HyunkuKwon opened 3 years ago
This paper shows me the great potential of culturomics. Changes in the number and frequency of English words, changes in grammar, people's collective memory of the past, the time it takes for different groups of people to become famous and forgotten, censorship and its victims... While reading, several questions arose for me, as follows.
What is the attitude of the institutions and lexicographers toward this approach? Specifically, do they actually use this method to guide or at least help them decide to add (remove) English words?
What I disagree with most about this paper is the proportion of the total data used, even though it could be called an example of "big data" research. The corpus constructed by the authors contains 4% of all books ever printed, but admittedly it is only "4%". I personally think it's still a very small number, so the conclusions drawn on this basis may not be perfect? What do you think about this?
Out of curiosity, I would like to ask one more question. If time does not permit, please ignore this. This method can detect grammatical changes, but what I want to know is how can this method help us to further understand the reasons behind these grammatical changes? For example, what are the possible social and historical factors that led to these grammatical changes? I think this will be very helpful for our social science.
Thanks!
Throughout history, people who read, write, and publish books are mostly likely to be the more educated, upper-middle class subgroup of the entire population. Hence, it is possible that the linguistics and cultures reflected in books (and in this paper) are likely to only represent those of the educated subgroup, rather than the entire population in each historical period. How serious is such selection bias? What are methodologies we can adopt to measure the bias? As books become more popularized overtime, do we see a decline in such selection bias?
This paper documented some important stylized facts and detected many interesting trends. However, it seems to fail to answer some causal questions that may be more interesting to social science researchers.
For example, the authors concluded the peak of "slavery" usage in 1860 (civil war) with "Cultural Change guides the concept we discuss.". However, the usage of "slavery" in Figure 1B shows a steady rise since 1840, so it is not obvious to me which came first, is it "the concept we discuss" or "cultural change", or are they intertwined and actually doesn't make sense to distinguish them?
To sum up my question, I would like to know if content analysis researchers have found new methods to discover causality, for example social sentiments toward certain words? Or do researchers still have to rely on some sort of institutional "exogenous variation" to discover causality?
This digitized approach to linguistic archaeology strikes me as incredibly powerful. Could we use a similar approach to document the evolution of political parties and political thought by tracking the semantic drift of terms like 'fascism', 'feminism', 'bourgeoise' etc. over time (perhaps through the use of word-embeddings)? Are there any (relatively) stationary linguistic markers of class that we can leverage to track the evolution of class-interests cross-sectionally? Can we predict the evolution/stagnation of movements by plotting the graphs of related word/reference-use?
These all seem to be potential areas of interest that could be addressed by this approach. More generally, what do you see as the biggest epistemological drawbacks/blindspots to this approach in answering such questions?
At the end of the article, the authors mention their study can be applied to newspapers, manuscripts, maps, artwork, etc. What do you think the significance of these areas?
Would it be possible to use computational content analysis methods to explain the underlying mechanisms behind the evolution of grammar or the shifts in meaning such as the ones displayed in the paper?
In some of the examples provided by the authors, it definitely makes sense to attribute the change in frequency or meaning of an n-gram to major historical events such as the World War I, but in cases where such a historical link is not readily apparent, can we use computational methods to systematically understand the mechanism (cultural, historical, scientific etc.) that led to the shift in meaning or popularity of a term or a particular usage of grammar? Or would we need to resort to more traditional qualitative research methods to come up with an hypothesis that tries to explain those mechanisms?
The censorship detection part was very interesting, but I a little bit worry about the data coverage for foreign books here. Google Books might not have as good coverage of non-English (or non-Western) books back in 2011 as they do today. So I wonder: will this imperfect coverage bias the corpus Michel et al. used? Is there any new research replicating this paper? It would be exciting to see the results ten years later.
Also, I wonder if social media had altered the use of language in published books. Any relevant literature on this? If it does, then social media really changed the way we speak!
I had a similar question to @romanticmonkey it seems that the corpus is sparsely populated by non-western text. In addition to compiling more data of under-represented languages in the corpus is their any other approach to de-bias the dataset? I am thinking the authors as well as any other researchers who use this and related corpus might make grand conclusions based on a very unrepresentative sample of data. Furthermore, this problem seems to be shared across disciplines, for ex biology and medicine have faced similar issues, what approaches can we as content analysts take to understand the weaknesses of our corpus?
I appreciate the questions that have already been asked about the biases of these books as a data source. In addition to people who publish books consisting of a more educated population, the language is also typically more formal, and there are of course differences between written and spoken language--I think those are important considerations when it comes to the applicability of this data to lexicographers. Along those lines, I'm curious if there are other corpuses that better approximate informal or spoken language. And is there any way to measure such qualities?
It is ironic to see that even though all the results are descriptive, there lacks sufficient descriptive information in the paper on the analytical sample itself - the corpus of those digitized books. Interestingly, results in the filter table (Table S1) of the online supplements indicate something that needs more elaboration. For example, the fraction of the corpus of English texts removed due to various filters is as high as 45.53%. Then the questions are, what is this analytical sample truly representative of? How much of the "linguistic and cultural phenomena that were reflected in the English language between 1800 and 2000" can be obtained from this sample?
This paper covers a wide ranges of interesting questions that could be answered through large-scale corpus and computational methods. It is inspiring to know how to reveal culture dynamics by text analysis. My questions for this paper might be:
Many thanks!
This paper is a cooperated study conducted by researchers together with employees from Google. I am wondering to deal with such a huge volume of data, how much computation power is required to analyze the data? If we want to use the data, I think it's more practical to use a slice of it instead of the raw data, so how to make the slice valid and (relatively) unbiased?
Besides, I would like to share the link to google book N-gram website We can browse the trend of n-gram across time, for example, content analysis and natural language processing:
I am amazed at how merely counting words from an enormous dataset can do. However, this also brings up some issues regarding its analytical aspects, i.e. a naive count of words may not be enough to draw some of the conclusion. I take issues with some of the threads the paper chooses to follow, for example, inferring censorship by a decrease in word count, comparing different periods without controlling for the social background at that time. Do you think more fine-grained controlling needs to be in place for the paper to draw those conclusions?
Q: I'm wondering what the limits of a purely "frequency" based approach to corpus analysis is. For instance, just as the majority of words in the English lexicon are classified as "dark matter," a.k.a. unpopular and relatively unknown, isn't it likely true that the majority of books are unpopular & unknown? Could weighting all publications equally regardless of how popular they are mistakenly capture regional/local/small-scale language variations or at least over-estimate their frequency?
This project seems to be the precursor of Google Ngram Viewer. I wonder which of them has a bigger corpus? Admittedly this project examines many interesting snippets of research topics, but I feel like it does not provide a powerful answer to the "so-what" problem. Nevertheless, I admire Google's effort to open up access to these data. Future scholars can use it to help out their research.
Word counting is really powerful! I think this study shows the effectiveness of keeping only the first order information of texts, i.e. word counts. I have two questions:
Thanks!
This is a really interesting paper and it was my first time seeing how to detect "Censorship and Suppression" with the word counting methods. Following @romanticmonkey 's idea, the non-western book may not be sufficiently explored and related corpus might not yet be established. Yet nowadays comparison of censored information nowadays may not be found in "Books" but rather in social media platforms or forums. As @william-wei-zhu also points out, "published books are most likely to be read by the more educated, upper-middle-class subgroup of the entire population" thus cannot represent the most cases we encounter now. I'm wondering is there any other research in terms of Censorship and Suppression on social media content?
This is a great paper and I'm impressed with the computational methods it used to analyze the corpus of ebooks the results and analysis. My question is, in the Construction of Historical N-grams Corpora part, how did researchers split Chinese text into 1-gram and n-gram? Because Chinese language is different from other languages (English, French, Spanish, German, Russian, and Hebrew), there is no whitespace between each characters and many phrases can be understood as a phrase or two phrases. I'm curious about how did the researchers handle possible confusions in such a language.
It is mentioned above that the corpus used in the research (about 4% of all books ever published) is just a small number. I don't think it's a big issue, as the research can at least detect "some" pattern, representative or not. In contrast to the amount, I care more about if there is any selection bias by using the digitalized books. To be more specific, is it the case that some books just tend to be more likely to be digitalized?
I am especially curious that how we should value such discovery about the pattern of using words. As the frequency of words seems not sufficient to evaluate any culture pattern.
Similar to a few others, I was thinking a lot about representation and the ability to 'generalize' this study... Even referring to the corpus as generally 'culturally-representative seems' like an overstatement. The corpus breakdown resembles a very american(/british/colonial)-centric lens of ‘global culture’ since over half the books included are works in English, but far less than 50% of the earth’s population speaks english (incl. as a second language - and as many people pointed to earlier, these populations are generally above average in wealth and education)... The density of english term frequency might be further confounded by the fact that (at least today) many non-English languages have adopted the english term for certain things (especially computer/internet related jargon) in light of globalization.
I am wondering what texts printed in a country's non-native language might reveal about that country/author/book? i.e. is a German author who publishes in English representative of German culture? International culture? Targeted English-speaking culture? And, since language is cyclical - what influence does the target audience have on the author's word choice and how might that influence this study?
In reflecting on the other students' questions about mechanism and causality, I am wondering whether we could tie the counts of words to authors of the works and publishers to understand somewhat of "where" the counts are coming from. That is, if we looked at publisher data from contents pages of these works we could see a few things 1) geography of the emergence of these cultural patterns 2) the genre from which certain patterns emerge. I think author data would be more difficult because the names would often be duplicated and probably quite generic, but perhaps we could understand more of the social dynamics of "where" in social space the knowledge is coming from. Do we increasingly have authors from different places or social classes? How does that affect the counts?
This article analyzed a corpus of digitalized books to provide insights about lexicography, the evolution of grammar, collective memory, the adoption of technology, the pursuit of fame, censorship, and historical epidemiology. However, I guess many people do not read books often. I am wondering if we are able to analyze large-scale data (text, videos, emojis, pictures, etc.) of social media in a similar way as most people use social media on a regular basis. Are there any difficulties, such as spidering and cleaning? Would ethics be a big issue?
I agree with everyone that this is such an interesting paper.
Firstly, echoing some of the sentiments above about selection bias, I also would like to ask about data selection for text analysis in general. I realized that choosing which "datasets" to include is not super clear cut when it comes to text analysis (as opposed to numeric datasets with defined variables): i.e, the sources of texts (books, news, social media, etc.), domains of texts (arts, history, everything, etc.), types of texts (verbal vs. written), etc. So, what are the general principles/questions one would/should ask in the process of data selection to not only curb selection bias but also to actually select the relevant datasets?
Secondly, a lot of the results were shared as visualizations in this paper. But from a statistical perspective, how can we validate if the results are robust: i.e., is there a similar concept of "significance" in text analysis and/or is prediction (or rather, the correctness of prediction) used sometimes (or a lot?) in text analysis?
The end of this paper asserts the need to apply similar culturomics analyses beyond books to image-based artifacts such as maps and artwork. I think these creates space for a vibrant intersection between computer vision techniques and NLP. Specifically, I could see this type of analysis being used to discern the different ways people post and react to photographs (e.g., describing a group of people protesting as an uprising vs. a riot). Is there research happening that attempts to merge these two fields so we can better understand how people perceive the world around them?
Post questions here for one or both of week's orienting readings:
Michel, Jean-Baptiste et al. 2010. “Quantitative Analysis of Culture Using Millions of Digitized Books.” Science express, December 16.