UChicago-Computational-Content-Analysis / Readings-Responses-2023

1 stars 0 forks source link

2. Counting Words & Phrases - orienting #51

Open JunsolKim opened 2 years ago

JunsolKim commented 2 years ago

Post questions here for this week's oritenting readings: Michel, Jean-Baptiste et al. 2010. “Quantitative Analysis of Culture Using Millions of Digitized Books.” Science express, December 16.

GabeNicholson commented 2 years ago

A great succinctly written article. The authors mention that books are the primary medium being scanned and that newspapers along with other artistic forms will follow suit. Has this been done yet? It seems to me to be a great "test case" that can see if the patterns from this paper generalize to other mediums. The issue with looking at the data first as the authors did is that random strong patterns happen all the time just from statistical noise alone.

Here are some examples of these spurious correlations that come from a fun website which tracks these things:

"Divorce Rates in Maine and Consumption of Margarine"-- R^2 correlation of 99% "Worldwide non-commercial space launches and Sociology Doctorates Awarded"--R^2 correlation of 79%. ... and many more Link: https://www.tylervigen.com/spurious-correlations

Sirius2713 commented 2 years ago

An interesting point the authors make is "Out with the Old". Generally, they conclude that the half-life of a year becomes shorter with time going by. But is it possible that this is related to the events happening in that year? For example, I believe 1945 which marks the end of WW2 has a longer half-life than some random year, say 1960. So do 2008 (Global Financial Crisis) and 2020 (COVID pandemic).

mikepackard415 commented 2 years ago

Definitely a cool article! One thing I think we should keep in mind when interpreting these results is the huge change in the literacy rate that occurs over this timespan (alongside an explosion in population). In 1800, roughly 12% of the global population (age 15+) was literate, compared to 82% in 2000 and 86% in 2016. I wonder whether books becoming accessible to a greater share of the population could have anything to do with some of the trends the authors find? In particular the trends regarding fame and even the regularization of language seem like they could be impacted by this.

Global Literacy

Source: https://ourworldindata.org/literacy

pranathiiyer commented 2 years ago

Super interesting article. It shows how frequency itself can be used to extrapolate meaningful insights. Even though i understand that they strictly speak in terms of frequency, it also makes me wonder how much more we can extrapolate about constructs such as fame, censorship etc by looking at collocation or contexts in which these words were used in?

Jiayu-Kang commented 2 years ago

As the authors mentioned, Culturomics results provide new and powerful evidence in the humanities. I'm curious about the potential biases the corpus could have and how to reduce them when analyzing human culture and historical trends. For instance, I know some printed letters are more likely to be confused with other letters when scanned using OCR, especially in certain fonts from older books. The composition of available digitalized books may also not be representative of all books. How does recent work address potential biases? What efforts have been done to improve the corpus quality?

melody1126 commented 2 years ago

To proxy cultural change from word usage in books, the authors were able to see when people stop using certain terms or irregular verbs, or stop referring to certain historical events or famous people. This echoes the demographic theory that social change happens through demographic replacement ("The Cohort as a Concept in the Study of Social Change" https://www.jstor.org/stable/2090964).

Cultural change, like science, can progress "one funeral at a time" – this is an informal version of Planck's principle about scientific advancement.

How would we measure the relationship between the cultural change measured by word usage, and this demographic change v.s. external factors?

Jasmine97Huang commented 2 years ago

I am impressed by the record linkage and information retrieval procedures outlined in the pursuit of fame study. However, I am a little skeptical about the conclusion about ‘culture’ drawn from books alone. In particular, books are critically linked with the education and social-economical status of the writers and the biases of institutions that publish and print them. It seems reasonable to believe that voices and ideas that make up the important social fabrics of the marginalized are stifled. The absence of such discussion feels somewhat concerning.

weeyanghello commented 2 years ago

As a linguistic anthropologist by training, it is remarkable for me how much the excitement effused from this article on "culturonomics" mirrors the same kind of excitement when charts to represent timelines were first invented in the 18th century. The ability to represent "all" of mankind's important events within a flowing chart enabled a very specific, zoomed-out perspective of humanity as a species, which lent itself to further linear conceptualizations of 'progress' and 'temporality' (Rosenberg 2007). The kind of analysis of trends in "culturonomics" itself can be traced back to the very kinds of historiography afforded by such graphic inventions of time. Taking a step back and looking at the meta-language of this article linking the denotational text in published articles to the entire of humanity, a few questions arise: What does it mean to say that the entire human species has one, analyzable culture? When the article makes grand claims such as "We are forgetting our past faster with each passing year" (179) based on the numerical value of bygone years mentioned in texts, who exactly is the "We"? And what "past" are we forgetting exactly? While this mode of analysis can easily track morphological changes in language, will it be able to do the same for semantic changes that are not linked to morphemes, i.e., intensional/extensional changes, such as in "God" or "apple" or "gay"? (N.B. the word 'Apple' used to refer to fruits in general, not a specific fruit).

Edit: Responding to isaduan's comment below, the authors do not claim a single "culture" in the same way we think of "culture", but they do figurate universals that they generalize to the entirety of mankind, especially within the language of the article. My point is that this, along with the definition of "culture", needs to be unpacked further: What do we take to be cultures? How do we figure cultures within and as associated with certain human universals? How do texts actually relate to "cultures"? The analysis, while interesting, really begs many other interesting questions that I think would be exciting to think further ☺️

Rosenberg, Daniel. 2007. “Joseph Priestly and the Graphic Invention of Modern Time.” Studies in Eighteenth-Century Culture 36:55–103

isaduan commented 2 years ago

As a linguistic anthropologist by training, it is remarkable for me how much the excitement effused from this article on "culturonomics" mirrors the same kind of excitement when charts to represent timelines were first invented in the 18th century. The ability to represent "all" of mankind's important events within a flowing chart enabled a very specific, zoomed-out perspective of humanity as a species, which lent itself to further linear conceptualizations of 'progress' and 'temporality' (Rosenberg 2007). The kind of analysis of trends in "culturonomics" itself can be traced back to the very kinds of historiography afforded by such graphic inventions of time. Taking a step back and looking at the meta-language of this article linking the denotational text in published articles to the entire of humanity, a few questions arise: What does it mean to say that the entire human species has one, analyzable culture? When the article makes grand claims such as "We are forgetting our past faster with each passing year" (179) based on the numerical value of bygone years mentioned in texts, who exactly is the "We"? And what "past" are we forgetting exactly? While this mode of analysis can easily track morphological changes in language, will it be able to do the same for semantic changes that are not linked to morphemes, i.e., intensional/extensional changes, such as in "God" or "apple" or "gay"? (N.B. the word 'Apple' used to refer to fruits in general, not a specific fruit).

Rosenberg, Daniel. 2007. “Joseph Priestly and the Graphic Invention of Modern Time.” Studies in Eighteenth-Century Culture 36:55–103

I might understand it wrongly, but I think the authors does not claim that there is only one culture to be analysed; rather, cultural studies can also benefit from comparing books/ other texts data of different cultures and communities. For example, the comparing between German and English enables the author to study the censorship in Germany.

Definitely a cool article! One thing I think we should keep in mind when interpreting these results is the huge change in the literacy rate that occurs over this timespan (alongside an explosion in population). In 1800, roughly 12% of the global population (age 15+) was literate, compared to 82% in 2000 and 86% in 2016. I wonder whether books becoming accessible to a greater share of the population could have anything to do with some of the trends the authors find? In particular the trends regarding fame and even the regularization of language seem like they could be impacted by this.

Global Literacy

Source: https://ourworldindata.org/literacy

I second the point. The accessibility of books and their formats could have deep, long-lasting effects of society. For example, Elizabeth L. Eisenstein's The Printing Revolution in Early Modern Europe argues that "paper printing" is deeply connected to the Enlightenment and defiance of authority.

Qiuyu-Li commented 2 years ago

This is really an interesting and inspiring story! It has done a bunch of exciting things with the data, and also indicates what else can be done with it. In particular, I'd like to see insights on the books themselves, which should have a close relationship with words coming out of it. For example, the demographics of readers are evolving over time. In an era where only elites can read, the books might be quite abstruse; while when literacy increases to a higher level, the books should be more commonplace and entertaining. I expect this pattern to affect the patterns reflected in the word frequency studies.

weeyanghello commented 2 years ago

As a linguistic anthropologist by training, it is remarkable for me how much the excitement effused from this article on "culturonomics" mirrors the same kind of excitement when charts to represent timelines were first invented in the 18th century. The ability to represent "all" of mankind's important events within a flowing chart enabled a very specific, zoomed-out perspective of humanity as a species, which lent itself to further linear conceptualizations of 'progress' and 'temporality' (Rosenberg 2007). The kind of analysis of trends in "culturonomics" itself can be traced back to the very kinds of historiography afforded by such graphic inventions of time. Taking a step back and looking at the meta-language of this article linking the denotational text in published articles to the entire of humanity, a few questions arise: What does it mean to say that the entire human species has one, analyzable culture? When the article makes grand claims such as "We are forgetting our past faster with each passing year" (179) based on the numerical value of bygone years mentioned in texts, who exactly is the "We"? And what "past" are we forgetting exactly? While this mode of analysis can easily track morphological changes in language, will it be able to do the same for semantic changes that are not linked to morphemes, i.e., intensional/extensional changes, such as in "God" or "apple" or "gay"? (N.B. the word 'Apple' used to refer to fruits in general, not a specific fruit). Rosenberg, Daniel. 2007. “Joseph Priestly and the Graphic Invention of Modern Time.” Studies in Eighteenth-Century Culture 36:55–103

I might understand it wrongly, but I think the authors does not claim that there is only one culture to be analysed; rather, cultural studies can also benefit from comparing books/ other texts data of different cultures and communities. For example, the comparing between German and English enables the author to study the censorship in Germany.

Definitely a cool article! One thing I think we should keep in mind when interpreting these results is the huge change in the literacy rate that occurs over this timespan (alongside an explosion in population). In 1800, roughly 12% of the global population (age 15+) was literate, compared to 82% in 2000 and 86% in 2016. I wonder whether books becoming accessible to a greater share of the population could have anything to do with some of the trends the authors find? In particular the trends regarding fame and even the regularization of language seem like they could be impacted by this.

Global Literacy

Source: https://ourworldindata.org/literacy

I second the point. The accessibility of books and their formats could have deep, long-lasting effects of society. For example, Elizabeth L. Eisenstein's The Printing Revolution in Early Modern Europe argues that "paper printing" is deeply connected to the Enlightenment and defiance of authority.

The authors do not claim a single "culture" in the same way we think of "culture", but they do figurate universals that they generalize to the entirety of mankind, especially within the language of the article. My point is that this, along with the definition of "culture", needs to be unpacked further: What do we take to be cultures? How do we figure cultures within and as associated with certain human universals? How do texts actually relate to "cultures"? How do texts actually relate to "cultures"? The analysis, while interesting, really begs many other interesting questions that I think would be exciting to think further ☺️

ValAlvernUChic commented 2 years ago

This article is super cool! Unless I'm misunderstanding, the paper essentially uses word counts (frequency of words) to make its inferences. I'm now thinking about further extensions, possibly looking at how words used in relation to one another have changed over time. I imagine that this could give quite a nice illustration of how word-meanings might change and when/how exactly they change. Off the top of my head, I'm thinking of the word dope; was dope first used in relation to drugs, people, or the concept of coolness. What events triggered the evolution of the word? How does the word reflect the state of the world at that time? It'd be possible to investigate just this one word but if we were interested in answering this question across many words then..

chuqingzhao commented 2 years ago

This paper is pretty cool! The authors measure the shifts in cultures, and languages based on counting word frequency. I wonder whether and how we should also investigate the context of words? For example, the author mentions the "women" and "men" as a measure of the battle of sexes. How do we know the books are talking about the inequality of gender rather than something else related to just women or men?

Hongkai040 commented 2 years ago

An awesome research. Many results are super interesting. I wonder do we need to take the time lag of publication into consideration? Authors may took years to write their books, editors and publishers need time to proofread and publish as well. And the time needed may be different case by case, year by year. It seems that this timelag prevents ur from exploring culturomics with higher time granularity.

Another quick question. It seems that most of the findings in the paper are based on 1-gram(some from 2-grams or 3-grams ). What can we do with 4-grams, 5-grams?

chentian418 commented 2 years ago

This paper conducts an extensive examination into digitalized book in the 1980 to 2000, which survey the vast terrain of "culturomics" and reflect the evolution of linguistic and cultural phenomena using word counts. And I have the following questions:

  1. Since only the digitalized text covers about 4% of all books ever printed, I am wondering about the distribution of topics for the digitalized one. Is there any signs of selection bias such as books that can be digitalized involve more regular forms of verbs?
  2. During the evolution of grammar, I am curious about what are the common ways to identify English irregular verbs as the same class of verbs?

Thanks!

facundosuenzo commented 2 years ago

Great thread @mikepackard415, @isaduan, and @weeyanghello! Eisenstein's works on the printing press are one of my favorites. My questions are along these lines. How can we address the many sources of variation that "history" may have that impact how those words are incorporated in texts? For instance, I was surprised that the authors claimed something like "Since then, the cultural adoption of technology has become more rapid." (p. 180). Can we say that words frequency drives adoption? In technology and media studies, it is commonly argued that each innovation is followed by a "moral panic" reaction about its impact on society (usually negative). Maybe those words didn't imply "adoption" but "rejection" or "fear".

konratp commented 2 years ago

I really enjoyed reading this paper and was amazed by all the possibilities for analysis this dataset allow for. I found it particularly interesting to see the effects of state censorship on the data (e.g. the censorship of Leo Trotsky's name), and it made me wonder if there are more subtle ways of censorship that have similar effects on certain words/phrases? And how do those more subtle ways of censorship differ in various contexts? I could imagine certain words or phrases being repressed in, for example, Western European democracies, while the same words and phrases are used more in the US where there are less restrictions on what constitutes free speech. Then again, I could see cultural factors (e.g. the red scare) contributing to higher levels of self-censorship in the US than other Western countries. Either way, this article makes me want to jump right into the dataset and manipulate it for hours.

YileC928 commented 2 years ago

Definitely an interesting paper! It demonstrates how one measure – frequency of words can offer so many fruitful insights. I am particularly intrigued by the ‘out with the old‘ section, as the authors not just frame it regarding attention/popularity (things we normally focus on), but also forgetting – the increasing tendency to forget the old. As the paper just provides descriptive data, I start to wonder about the reasons behind it: Could it be that while information is constantly exploding, collectively we only have a limited amount of bandwidth? How could we design potential research to dig into that?

hsinkengling commented 2 years ago
  1. What does it mean for a data to be "high throughput"?

  2. When the author are doing comparisons of translations of the same word, I wonder if the total volume of books published in that language would affect the popularity index?

zixu12 commented 2 years ago

This is an amazing project considering the huge amount of corpus is used, and the authors did many interesting descriptive analyses. I am a bit curious how the authors select the 4% books, and also I doubt what is recorded in books can be different across time, i.e. books published in the 21st century are definitely more diverse than that in the past, and more 'low-ended'. This might pose restrictions on certain research.

Emily-fyeh commented 2 years ago

This article provides multiple directions for serial cultural research. I would also be interested in some more focused, detailed investigations on a smaller scale of culturomics study, preferably incorporating other research designs. For example, exploring the boundaries and interrelationships between languages, races, and cultures before drawing implications can help us to reproduce the whole picture of literacy, press, and lives of people in different eras.

MengChenC commented 2 years ago

There was a breakthrough from the paper in terms of expanding the research scope in cultures and other human behaviors. What I am curious is its application. Let say if we conducted the same measure on newspaper, what can we expect? We get more abundant data, aside from the difficulty of ingesting and processing the data, we also need to identify the trends for even more volatile information from news(paper). Also, after 2000, the spread of information reaches to a new dimension, the expanding and fickle information and misinformation are reconstructing our world. What can we learn from the method on this new era? What are some pros and cons to look into?

ZacharyHinds commented 2 years ago

This article is very interesting, and it leads me to wonder how similar analysis could be used in future studies on the vast amount of textual content online? Does our lexicon and grammar change at any different rate when we look to the changes in how people communicate online? I imagine that the less formal nature of online communication could even give insight into the more "natural" ways in which people use language.

LuZhang0128 commented 2 years ago

Awesome research! I wonder, however, if there is any potential systematic bias that will hinder our understanding of historical trends. E-books are common nowadays so probably every book has its chance to be included in the database. However, books in earlier years need to be popular to be passed down. It would be impossible to go back and collect the books. But I wonder if there's anything we can do to adjust the result statistically to alleviate the bias?

hshi420 commented 2 years ago

Although books and newspapers were main means of sharing ideas and information, we now have more ideas posted on the internet on various platforms. I do think it is important to keep the source consistent through out the years of interest, but I also think it is a good idea to select text type based on years. I'd like to know what the potential bias is in this case and how we might alleviate it.

NaiyuJ commented 2 years ago

I'm thinking of a broad question of what the role of text analysis is in studying the evolution of culture or the history of science. It seems that text analysis is especially suitable for this kind of mixed-methods work with the combination of qualitative and quantitative data to explore history inquiry.

sizhenf commented 2 years ago

I am very impressed by the enormous data collected and compiled in this project. Without such a big data set, I am sure that there are many research questions we can study with, and the paper gives us a taste of it. What I had in mind when reading this paper was the book 1984, written by George Orwell. The Party created a language called the "Newspeak" which simplifies grammar and restricts vocabulary of English language to limit the individual's ability to think. With access to the data, we could potentially look at how the change in grammar and vocabulary is related to the evolution/backsliding of democracies.

kelseywu99 commented 2 years ago

I am particularly interested in how censorship was detected through word usage frequency, and I agree with the conclusions of the authors that other media such as newspapers, manuscripts, and arts may be incorporated into the expansion of this project. However, it is common for writers to use euphemism to evade censorship or political repercussions; "that famine", "that massacre", "tragedy occurred in year x" just to name a few examples that don’t bear much meaning if not being read in context. I was wondering how the usage of euphemism may be factored into word frequency?

sudhamshow commented 2 years ago

I was wondering if studying the evolution of cultural could have biased results based on the source used for its analysis - The authors use data from Google Books which itself, according to the paper contains books from 40 different libraries around the world. Published books often go through some amount of scrutiny, proof-reading and other kinds of pre-publication formalities. This however induces the use of some kind of sophisticated literature, when compared to a local publication (which more often than not contains the local lingo). Would we have seen the same observations (the relative percentage) if sources other than published books would have been considered (excluding the data from the books)?

AllisonXiong commented 2 years ago

Great work with some sense of humor. It's astonishing how much insights can be gained from simply counting words frequency. My question is, does the media (published books) brings in some bias to the corpus? For instance, in the era before paperback becomes popular, books were somewhat luxurious and can only accessed by people with high social status, and may be consequently more representative of language use of these people as well.

sborislo commented 5 months ago

I think it's clear that having this information on unigrams over time and across contexts is useful, but it wasn't clear to me if they excluded proper nouns from all of their analyses or just some of them. If they weren't excluded, then I think the dynamicity of language might be overstated, since many books contain proper nouns that are specific to that context (e.g., fantasy novels). My question is, when they were excluded, how were the authors able to do so? That is, what cues did they use to differentiate them (since capitalization is insufficient, given words like English)?

yuzhouw313 commented 5 months ago

The utilization of word counts and phrase traces across diverse documents are definitely proven to be a valuable lens for discerning cultural, historical, and linguistic trends from reading this intriguing piece. However, as suggested by the authors that it is possible to use the culturomic methods with not only established historical perspectives but a forward-looking perspective(p. 181), I wonder how can we extend the potential for these methods (words counting and phrase tracing) to predicting forthcoming social, cultural, and linguistic trends. In addition, in envisioning a future where culturomic methods play a predictive role, how can we ensure the reliability and accuracy of the insights generated, particularly in the context of complex and dynamic societal changes as well as the explosion of online data?

XiaotongCui commented 5 months ago

This is a fascinating research! However, I have a question about how we can unify the criteria among different languages. For example, a single Chinese word may have multiple translations in English, leading to potential confusion when compiling statistics across languages.

Dededon commented 5 months ago

That's a pretty impressive paper speaking in its methodologies (and the worload the authors have done), but I'm more interested about whether the research questions the authors asked are valid social science questions. Some of the assumptions made by the authors are questionable, say, the authors' point of out with the old, they just used the year as the 1-gram and performed a counting of it. It is no doubt that cleaning the whole corpus is totally impossible in this case, but the same 1-gram of number could have many possible ways to appear in the text. Such way of operationalization is little bit questionable.

joylin0209 commented 5 months ago

This is a very interesting article. The author shows the changes in the frequency of use of words. What I'm curious is, if and how can differences in accompanying contextual word occurrences be detected over time when a specific word occurs? For example, I suspect that after 2019, the word "vaccine" will appear more frequently, together with words such as "mask" and "quarantine." I'd like to know more about what was discovered about this.

volt-1 commented 5 months ago

The fact that the database includes books in various languages and the statistical language determination using the Popat algorithm is impressive and truly a commitment to linguistic diversity. However, this raises questions about the selection criteria for these languages. Were certain languages prioritized based on the number of speakers, historical significance, or the availability of texts?

anzhichen1999 commented 5 months ago

Based on censorship and suppression, considering the current advancements in machine learning and natural language processing, do you envisage a possibility to refine these culturomic methods to not only analyze historical data but also to predict future trends in cultural influence? Furthermore, how might integrating real-time data from social media and other digital platforms enhance the predictive power of these models, especially in identifying early signs of suppression or censorship in the digital age?