2. Counting Words & Phrases to Trace the Distribution of Meaning- orienting

lkcao commented 8 months ago

Post questions here for this week's oritenting readings:

Michel, Jean-Baptiste et al. 2010. “Quantitative Analysis of Culture Using Millions of Digitized Books.” Science express, December 16.

cty20010831 commented 8 months ago

While I do think it is interesting to examine collective memory with regards to the past, present, and future through the lens of (digitized) books, I am curious why the authors took two different and, in my opinion, incomparable angles to examine collective memory with the old and new. While I personally think generating a list of 147 specific inventions and track their word frequency over a time span makes much sense, I am curious why the authors merely compare the usage of year number (e.g., 1880) across time instead of comparing some specific events or even past inventions (which would serve as a perfect comparison with adopting the new)?

Another question I have is that while I am impressed by how utilizing massive number of digitized books can help us describe the trend of grammar evolution and censorship and suppression, I am curious how can we turn the “descriptive” to “inferential.” Specifically, how can we develop trend analysis algorithms that can predict the evolution of grammar and censorship based on real-time, massive data collected?

Twilight233333 commented 8 months ago

In the discussion review section, the author gives an obvious example of Nazis and then uses statistical results to support that claim. So, is it possible to identify potential censors by comparing books in different languages? Not about having a hypothesis and then testing it. Or is it able to identify the culture that specific languages are trying to promote? (e.g. diversity)

Or, to what extent are books written in different languages representative of the countries where the language is spoken? For example, is the English writing done by non-English speaking countries to promote their own culture enough to cause bias? To what extent can books represent culture as carriers? Do books, which require more time and money than tweets， articles, hinder the presentation of some cultural phenomena?

chanteriam commented 8 months ago

The authors make the argument that their analysis will help lexicographers stay up-to-date in their reporting/recording of word usage through their analysis on books. However, given the relatively slow pace of publishing, even that corpus could be behind actual work usage, not to mention that many linguistic dialects may not be represented in books, etc...

Is it possible to extend this analysis to a medium that would be more reflective of changing cultural trends around language use, such as social media? With slang and other forms of word shortenings, is using an n of 5 for their n-gram analysis itself being current with lexical trends?

Caojie2001 commented 8 months ago

This article provides a diverse example of culturomic analyses based on an impressively large corpus, providing evidence for various topics ranging from linguistics to political science. The author's discussion on the usage frequency of irregular verbs and their regular counterparts provides empirical material for the evolution of English grammar, which is only possible with the support of a corpus of this scale. However, although these analyses provide us with an intuitive impression of overall trends, the way the authors connect them with historical events still needs to be clarified. For example, why are certain periods, words, or patterns chosen to demonstrate the connection? In the Tiananmen Square case, the patterns exhibited during the two incidents are rather complicated: in Chinese texts, the discussion started to elevate before 1976 and 1989, while the discussion in English texts shows a rather lagging trend. Meanwhile, there exists another small peak of discussion in Chinese texts, which is higher than the 1989 one and is not explained.

Another question is whether it would be possible to construct large-scale corpora of information in other forms, such as human speech, to provide empirical evidence for more diverse topics, such as phonological theories about the historical development of certain languages, which can only be verified through small-scale experiments so far.

bucketteOfIvy commented 8 months ago

While the Michel et al. (2011) have (and highlight) their incredibly large corpus consisting of "~4% of all books ever published" in a multiple languages over multiple centuries, size is not a cure-all for sampling bias. It's likely that Google -- who constructed their corpus -- did not simply scan every book they could find, but prioritized certain books as to better spend their allocated funds, which would result in a bias in the sampled data. Nonetheless, if one wants to sample that much text, they'd be hard pressed to find a better source. When using counting techniques such as those in this study, how can one identify and address biases in their sampled data?

ddlxdd commented 8 months ago

One thing I found interesting is that when they describe releasing the data, they mention that in order to release the data, and under the copyright restrictions, they have to ask questions like how often a given 1-gram and n-gram (up to 5-gram) were used over time. What is the copyright restriction? What are some other advantages of using this approach, and what could be the impact on the analysis?

Dededon commented 8 months ago

That's a pretty impressive paper speaking in its methodologies (and the worload the authors have done), but I'm more interested about whether the research questions the authors asked are valid social science questions. Some of the assumptions made by the authors are questionable, say, the authors' point of out with the old, they just used the year as the 1-gram and performed a counting of it. It is no doubt that cleaning the whole corpus is totally impossible in this case, but the same 1-gram of number could have many possible ways to appear in the text. Such way of operationalization is little bit questionable.

Vindmn1234 commented 8 months ago

This study was groundbreaking because it introduced "culturomics," a method involves using this vast collection of digitized books to analyze cultural trends and patterns over time. But I think the methodology might be oversimplified because the meaning of words are also determined by the context in which words are used, this is hard to capture through mere frequency analysis. But thankfully with the explosive advancement of deep learning, we now have more much sophisticated language models to help us evaluate text. Also, the study mostly focused on English words and corpus thus failed to consider the interaction and assimilation between different languages and cultures, for example the impact of non-English, borrowed words on English words evolving process.

YucanLei commented 8 months ago

The paper introduced a quite new approach to analysis called "culturomics." Basically an extension of scientific inquiry to a wide array of new phenomena in the social sciences and the humanities. The authors suggested the previous attempts to use quantitative methods to study culture changes failed due to lack of suitable data. However, culturomics succeded as a quantitative investigation of cultural trends thanks to large corpus of digitized texts. The concept is hardly new, but the approach is novel.

I am interested on how the authors can reduce the potential biases of the research. Though 4% of all the books ever published is a huge number, however, 4% in itself is hardly a representative proportion. In other words, the research is under representing the books that are no longer published or lack electronic versions. This bias seems inherent and unavoidable.

muhua-h commented 8 months ago

Given it is published in 2011, the paper's innovative use of n-gram and time series analysis to explore long-term linguistic and cultural shifts raises intriguing possibilities. One pertinent question is whether such a method could be adapted for more immediate applications, such as analyzing micro-cultural shifts within an organization or real-time cultural trends in the digital world. This approach could offer valuable insights into the dynamic nature of cultural and linguistic changes as they occur.

Furthermore, as NLP techniques continue to evolve, offering more comprehensive text analysis tools, it's worth considering the future relevance of counting-based analyses like n-grams. Will these advanced methods render simpler counting approaches obsolete, or will there still be a unique value in the more straightforward, quantifiable insights that methods like n-grams provide, especially when dealing with massive datasets or when seeking to capture broader cultural trends over extensive periods?

sborislo commented 8 months ago

I think it's clear that having this information on unigrams over time and across contexts is useful, but it wasn't clear to me if they excluded proper nouns from all of their analyses or just some of them. If they weren't excluded, then I think the dynamicity of language might be overstated, since many books contain proper nouns that are specific to that context (e.g., fantasy novels). My question is, when they were excluded, how were the authors able to do so? That is, what cues did they use to differentiate them (since capitalization is insufficient, given words like English)?

XiaotongCui commented 8 months ago

This is a fascinating research! However, I have a question about how we can unify the criteria among different languages. For example, a single Chinese word may have multiple translations in English, leading to potential confusion when compiling statistics across languages.

alejandrosarria0296 commented 8 months ago

Although the work done by the research team is admirable in terms of bith scope and quality im still bugged by how they used n-gram frequency as their main measure. Word frequency can certainly be a key insight for culturomics but it is evident that it misses out on a wide array of textual elements (sentence structure, non-literal uses of a word, etc.). Without stepping out of the real of n-gram counting, are there other measures that could be used to determine the cultural relevance of a word/phrase/sentence that go beyond counting its frequency in a certain time period?

yunfeiavawang commented 8 months ago

This study is an excellent example of representing trends with high-quality data by counting words and phrases. A lot of cultural dimensions are analyzed, including cultural turnovers (how long it takes for people to forget and adapt to new cultural paradigms), censorship (detecting censored words by comparing publications under different regimes), and scholarship concentrations (research topics). However, I have a question about the representativeness of the book dataset, which mainly represents the elite culture and ignores folk culture from the lower class. I am wondering if there's any possibility of analyzing the yellow newspapers (such as The Sun) in the past, or short videos today, to dig into the culturomics among the lower-class population.

yuzhouw313 commented 8 months ago

The utilization of word counts and phrase traces across diverse documents are definitely proven to be a valuable lens for discerning cultural, historical, and linguistic trends from reading this intriguing piece. However, as suggested by the authors that it is possible to use the culturomic methods with not only established historical perspectives but a forward-looking perspective(p. 181), I wonder how can we extend the potential for these methods (words counting and phrase tracing) to predicting forthcoming social, cultural, and linguistic trends. In addition, in envisioning a future where culturomic methods play a predictive role, how can we ensure the reliability and accuracy of the insights generated, particularly in the context of complex and dynamic societal changes as well as the explosion of online data?

joylin0209 commented 8 months ago

This is a very interesting article. The author shows the changes in the frequency of use of words. What I'm curious is, if and how can differences in accompanying contextual word occurrences be detected over time when a specific word occurs? For example, I suspect that after 2019, the word "vaccine" will appear more frequently, together with words such as "mask" and "quarantine." I'd like to know more about what was discovered about this.

Marugannwg commented 8 months ago

A great example of a huge corpus and a demonstration of how frequency analysis of words/phrases (n-grams) can be used in the text domain. It touched on the evolution of language and many sociological interests, like the rise and fall of fame and taboos.

I'm extremely curious about how much effort this kind of project would take;
Also, do we, as college students, have access to the digital curation of books? How available and ethical it is to study them? (Looks like the authors are also aware of the copywrites and ethical issues, and used it to justify their quantitative/counting methodology.)

hongste7 commented 8 months ago

The paper uses computational analysis to observe changes in language and cultural phenomena. How do the authors address the challenge of interpreting semantic changes in language over time, and the risk of misinterpreting historical contexts? I imagine some words change definitions over time, or may mean different things in different contexts. (Or perhaps this isn't a concern because the majority of words haven't undergone this change?).

erikaz1 commented 8 months ago

My question concerns the interpretation of Michel et al.'s findings regarding linguistic and cultural change (p. 3). How big do "changes in lexicon" have to be in order to signify cultural/value shift? When does some collective action become large enough to be considered "culture"? How have we determined these cutoffs (maybe statistical significance - but then how do the stats translate to lived experience; is any and all change something meaningful/useful)?

naivetoad commented 8 months ago

This study utilizes a vast corpus of digitized texts to analyze cultural trends quantitatively. It has many interesting findings in linguistic and cultural phenomena in the past. But I'm curious about the long-term implications of these trends observe in the study for understanding linguistic and cultural evolution.

Audacity88 commented 8 months ago

I find this research very interesting, and am hoping to do a similar study using the Ngrams corpus. With that in mind, I wonder about how far back the validity of this corpus extends. The authors note that "The early decades are represented by only a few books per year, comprising several hundred thousand words. By 1800, the corpus grows to 98 million words per year; by 1900, 1.8 billion; and by 2000, 11 billion". A few books per year seems too few to meaningfully represent an era; how many is enough?

ana-yurt commented 8 months ago

I find this paper fascinating, as it taps into a realm that has traditionally been occupied by cultural historians and theorists. I have a question about the sample. "Over 15 million books have been digitized [~12% of all books ever published]," from which the authors "selected a subset of over 5 million books for analysis." I wonder whether the roughly 4 percent of books used in this study is a somewhat random sample—do we have enough knowledge of the digitization process to make this assumption?

The speed of collective memory "metabolism" and the scale of censorship are several really fascinating points. I wonder, with such expansive data, if we can also map out the contextual meanings of words and how they have shifted over time.

Carolineyx commented 8 months ago

I appreciate the insights this paper has provided. I have two questions:

If I would like to inquire about studying patterns across cultures and geographical regions. What would be a good approach(research/method design) to extend the research beyond the English-writing world?
I'm also curious as I haven't seen any text analyses across different time periods consider the potential impact of 'who can write and publish' at various historical times. This factor might significantly influence the content included in books. If I want to control for this, should I add 'main social class representation of the published books' as a control variable?

ethanjkoz commented 8 months ago

I've watched a few youtube videos about Zipf's law and discussions on this type of research if not this research itself. The authors claim culturonomics may be a useful tool in the humanities, but I see obvious applications to the social sciences. Particularly cultural sociology and political sciences. My question here is how do the authors know when the words they select are what they actually are? To clarify, I suppose I am asking how would one take into consideration the context surrounding many of these words? Can we claim that the rise in "the cell" is due to the popularity of science or some skew from some thing else? Furthermore, how was this corpus collected, what determined whether a book was in the collection or not? Which books are systematically left out?

donatellafelice commented 8 months ago

Ethan and I seemed to have been writing similar points! I was also curious if they looked at any of the surrounding words, and tried to see if the contexts of those words had changed, thus if the meanings of the words changed or in the case of homophones as Ethan mentioned. And was also curious about the idea that they chose the text "on the basis of the quality of their OCR and metadata", which seems like an interesting choice, as some of the most culturally relevant texts could have simply been hard to read in those early years...

But in the spirit of trying to find a new question, I wondered about their note on being famous: the authors note "actors tend to become famous earliest, at around 30". In my head, that feels incongruent, and it is surely not true in America for female actors. I would note that female actors are often much more famous than their male counterparts (even more noticeable currently and for some time with female singers). Thus, I was wondering about how they were looking at their interpretations. This methodology seems that it will lend itself easily to possible misunderstanding, misreading or biases (of the authors or within the source work), as others have mentioned. What are some methods for checking their interpretations? With such a huge corpus, should random tests be performed? If so, of what?

LyuZejian commented 8 months ago

I have to say, this article is fascinating. And I am astonished that it was published early in 2011. Despite its novelty, I must point out several points of concern regarding the analysis presented. According to my experience, culture is really a fragile and capricious object for investigation. To what extends quantitative method on mass data could be applied is questionable. This data is gathered from Google Books, and the role of books within people's cultural activity has changed from 1800 to 2000 continuously. Whether the book could represent people's focus on time, or attention to fame is questionable. This is just a concern, especially what we need to keep in mind when we try to conduct its work on the recent corpus. The confounding of publication system needs to be addressed more.

One interesting question might be, how could we enhance the method of "counting phrases"? Counting is one of the oldest methods for quantification and reason. I think given the development of nature language understanding, the pair analysis between "phrase" and "context" is interesting. By tracing the word-context pair appearing in corpus, we can discern the more sophisticated pattern of usage just like the dynamic of culture, the changing meaning of words.

HamsterradYC commented 8 months ago

Considering that the text corpus spans a long period and includes works that reflect historical biases, and stereotypes, as well as potentially censored or propagandist material, how do we avoid perpetuating or distorting these biases in subsequent analyses of historical data?

runlinw0525 commented 8 months ago

I am very shocked by the the size of this corpus which nearly makes this quantitative analysis impossible to replicate on an individual basis. However, given that the corpus has a lot more English texts compared to other languages (diffferent weights), how would this affect our understanding of cultural and linguistic shifts in areas where English isn't the primary language? Also, what steps could be taken to make the entire analysis more inclusive of global cultural trends such as readjusting the components of the corpus (although I know it could be hard to achieve in the real-life setting)?

beilrz commented 8 months ago

I think this is an interesting study, with many potentials for future exploration. One concerns of me is the reliance on book data, which could induce a time delay in the analysis, as it could take a significant amount of time to publish a book; also, people who author books could be older than the average of the society, so they could incline to use more "old-fashioned" language. Furthermore, language in book could more formal than daily communication. As such, it is great to extend this study to also consider language use in more casual setting (such as social media, flyers/advertisements ....) to enhance this study. Another logical continuation is to use embedding to monitor the shift of the word meaning over time: embedding model can be trained on corpus of different time period, and similar word can be found, in different time periods, for a given word.

XiaodiYang2001 commented 8 months ago

The authors said their study covered 4% of the books. I'm curious how they selected these books? The article only describes that these books come from different languages and time points. What is the publication volume of these books? What types are they? I also want to learn from him how to better analyze texts in different languages. In addition, I feel that this is an excellent research with strong practical significance. It is more like building a huge database.

chenyt16 commented 8 months ago

I am curious about what the selection criteria are for the over 5 million books, so I went to the supplementary material, which articulates the materials and methods in detail. According to it, the selection is more based on the quality of the data -- "Performed further filtering steps to select only a subset of books with highly accurate metadata". But the singular pursuit of high-quality data is bound to sacrifice the comprehensive coverage of the data, especially considering the significant evolution of literary genres over the past 200 years. I also question whether 'word' is the most suitable unit for analyzing overarching questions, such as human culture. Taking grammar as an example, analyzing at the word level constrains the analysis to past tense verbs, overlooking other potential evolutions in sentence structure or tense preference.

michplunkett commented 8 months ago

I enjoyed this article and look forward to further extensions of this methodology. I wonder how they will approach doing this with less permanent data than the written record. Video platforms like Periscope, Vine (RIP), Instagram's Reels, and TikTok feel like the next logical progression of this work, but they all have significant security layers around them. I wonder how future researchers will access data that is either no longer publicly accessible (Vine, etc.) or short-lived video segments like Reels.

Some places are already working on these problems, but I believe they are being done at more minor scales: Link. Either way, I am excited to see extensions of this in the future.

anzhichen1999 commented 8 months ago

Based on censorship and suppression, considering the current advancements in machine learning and natural language processing, do you envisage a possibility to refine these culturomic methods to not only analyze historical data but also to predict future trends in cultural influence? Furthermore, how might integrating real-time data from social media and other digital platforms enhance the predictive power of these models, especially in identifying early signs of suppression or censorship in the digital age?

yueqil2 commented 8 months ago

This research project display the underlying capacity of culturomics, which apply high-throughput data collection and try to analysis the human culture. Culturomic tools seem to improve lexicographers' work in several ways, such as finding low-frequency words, providing accurate estimates of current frequency trends (p177), and even improving accuracy of etymology. However, I wonder if those tools could also trace the evolution of a specific concept, which means go beyond the "frequency" and seek for "interpretation".

QIXIN-LIN commented 8 months ago

In this study, I'm wondering about the representativeness of digitized books, which require significant resources to produce. Can they truly reflect wider cultural trends? Also, considering the ease of creating and the specific user base of tweets and online posts, is it feasible to say that these digital texts represent the cultural tendencies of the entire population? This raises a crucial point for social media analysis: should we focus on the representativeness of these texts, or is it more practical to narrow our research scope?

Brian-W00 commented 8 months ago

In study 'Quantitative Analysis of Culture Using Millions of Digitized Books', what is the impact of techniques they use for future research in humanities and social sciences? How do these methods deal with limitations and biases in big digital data?

floriatea commented 6 months ago

With the quantifiable shifts in language highlighted through culturomic studies, what predictions can we make about the future evolution of language? Specifically, how might new forms of digital communication (e.g., social media, texting) further accelerate these changes? Considering the analysis of cultural trends across different languages and regions, what does culturomic research suggest about the convergence or divergence of global cultures in the digital age? Will the increasing accessibility to digital books and media lead to more homogenized global cultures, or will cultural distinctiveness persist?

JessicaCaishanghai commented 6 months ago

How does the concept of "culturomics" utilize large-scale data analysis to investigate linguistic and cultural phenomena, and what are some of the key insights it has provided about fields such as lexicography, the evolution of grammar, collective memory, the adoption of technology, the pursuit of fame, censorship, and historical epidemiology?

UChicago-Computational-Content-Analysis / Readings-Responses-2024-Winter

2. Counting Words & Phrases to Trace the Distribution of Meaning- orienting #55