Counting Words & Phrases - Jean-Baptiste et al 2010

jamesallenevans commented 4 years ago

Post questions here for:

Michel, Jean-Baptiste et al. 2010. “Quantitative Analysis of Culture Using Millions of Digitized Books.” Science express, December 16.

lkcao commented 4 years ago

1, I have noticed that this study (and many other similar studies) have chosen some samples based on their potential to attract public interest (like 25 most famous individuals from every occupations, or some most famous events→Nazi supervision, the world wars, etc.) The choice and focus on these samples do not follow a coherent theoretical thread. Can we say that this is because computational study is a newly-born field, and currently there is no much academic consensus? And will this trend continue, since with more scholars of cultural studies enter the field, they may have a higher requirement for theoretical development?

katykoenig commented 4 years ago

While this paper was interesting in its (for the time) novel approach of mining Google books to understand cultural and linguistic changes, because it was a survey of applications of this data, I was left with questions regarding research that was came from this approach:

1.) The paper notes the differences in words used in books vs. English words as defined by established dictionaries and then reflects on the addition of words to this dictionaries. Using Google books corpus, could someone successfully predict new additions to dictionaries?

2.) The paper analyzes woks from the 1500s onward, but if we expand the corpus to include works in Old and Middle English, would we be able to study language change/exportation more thoroughly? For example, while Old English was primarily influenced by Germanic tribes, after the Norman Conquest in 1066, we get the "frenchification" of (Old) English to Middle English. I am wondering if examining works from before and after could shed light on why some words (and and other linguistic phenomenon) were adopted from French while others remained more Germanic and if that could be extrapolated to the colonization of other languages.

laurenjli commented 4 years ago

This article provides quantitative evidence of "our increasing tendency to forget the old”. I found this particularly interesting given the dialogue around declining attention spans in the younger generation. I would be interested to see similar analysis done on social media posts to understand how information travels and persists in this medium outside of books.

bjcliang-uchi commented 4 years ago

I am quite impressed by the overall approaches but I am not convinced by the censorship part.

It seems that they did not control for any time/ culture/ socio-economic factors when calculating the suppression index. The index is computed by dividing a word's frequency from the target period (1933-1945) by the mean frequency from the periods before and after the target period (1925-1933). But the target period is also a time of war: the attention of the Media, even without any form of censorship, is expected to shift from arts and literature to military and political content. The level of suppression is therefore not accurate.
They also detect censorship by comparing word frequency across countries: if the frequency of "Tiananmen" increases in the U.S. but decreases in China, it suggests censorship. However, what we know as the ground truth is that the word "Tiananmen," as a symbol of the Chinese Government, is used to describe not just the two incidents but lots of other things including a series of national military parades. Therefore, it just seems that the accuracy of detecting censorship using the methods they suggest is quite low.

I think in general I just feel like that despite the amazing size of text corpus, most of the methods they applied are exploratory rather than analytical (i.e. mostly just counting words). This is understandable given how much computing power is needed to do any analysis on such a tremendous dataset, but these exploratory methods do not, as @clk16 says, "follow a coherent theoretical thread". I am just wondering if there is any case that some more rigorous social science implications are derived from a similarly sizable dataset.

ckoerner648 commented 4 years ago

Jean-Baptiste et al 2010 mention the example of the painter Marc Chagall whose work was censored by the German Nazi regime between 1933 and 1945. In this period, the frequency of his name decreases sharply in the German part of the publications–whereas in the British part of the corpus the frequency of his name steadily increased. I am wondering if Chagall’s reputation might have been harmed in Germany even after the censorship by the Nazis after World War II was over? How could we measure that? Could we even build such a reputation-harm model and estimate its effect on the future frequency of a person’s popularity or the spread of an idea?

wunicoleshuhui commented 4 years ago

I'm interested in the nuances of using this approach to track trends of phrases in published books. Many successful books have different editions published in different years, and how do we include these different editions in the corpus too? The majority of the content would usually stay the same, so that might create bias in the results, but at the same time newer editions also have lengthy newer prefaces and introductions etc. that include newer words. How carefully do we need to account for those issues?

rkcatipon commented 4 years ago

I was surprised when the article stated that, "52% of the English lexicon – the majority of the words used in English books – consists of lexical “dark matter” undocumented in standard references." It got me thinking about how much of human language is not captured in dictionaries or even books at all. There seems to be a fascinating section of language that evolves outside of recorded posterity.

I recently heard that young women tend to lead innovation in the English language (Thompson, 2015). I curious to see if there was some way to capture the innovation that occurs outside of dictionaries, and that may also dissipate outside of references as well. Would it be possible to analyze the divergence between the tomes of language and everyday exchanges, and also to look for what exacerbates that divide? What social factors may dictate what is and is not recorded in standard references?

-- Thompson, Helen. August 10, 2015. "Teenage Girls Have Led Language Innovation for Centuries", Smart News, Smithsonian Magazine. Retrieved from https://www.smithsonianmag.com/smart-news/teenage-girls-have-been-revolutionizing-language-16th-century-180956216/

rachel-ker commented 4 years ago

I'm quite fascinated by this article and how it explores the quantification of culture through word counts over time. I think this would be even more fascinating today where user generated content is prevalent on social media or blogs. In that case, how would we deal with the nonword categories i.e. misspellings, non alphabetic characters that the authors excluded from the study in this case? I presume these will be even more prevalent but also they would carry substantial meaning today as well as the inclusion of emoji. Should they be treated as separate words? Can we use stemming like methods to merge them with current proper words?

ccsuehara commented 4 years ago

This is indeed a fascinating article. I was wondering how much of a gold mine this source of information is. The authors draw several examples at how they applied culturomics in just a few pages. Do you have any interesting examples, or research ideas, about how to use these millions of digitized books? Thanks

di-Tong commented 4 years ago

As @bjcliang-uchi , I feel like the work demonstrated in this article is more descriptive/explorative than analytical, and lacks rigor in formulating findings from evidence, just as is mentioned in the 'culturomics' section: "the challenge of culturomics lies in the interpretation of this evidence."

For example, the authors summarize from figure 5D that "“féminisme” made early inroads in France, but the US proved to be a more fertile environment in the long run." However, the pure change of frequency of the word "feminism" in the corpus does not necessarily correspond to how fertile a broad environment is. It could be the more frequent mention of feminism in the US books is owing to more criticism towards it than that in France. Than the story about environment would be entirely different. Maybe a simple sentiment analysis could help us better understand the real situation.

I have several questions/comments: (1) Book is just a part of human information source and culture with specific audience (consider the era when only the upper class was literate and the current era where cultural consumption lies more in social media). Hence I think we need to be cautious to make grand arguments about collective human memory and perceptions only based on the books data. (2) How can we better capture the different meaning and usage of the same words/phrases when used by different entities under different contexts and historical periods? (3) Could you give us some examples about how we can utilize this counting method to produce supportive preliminary evidence and then combine with other methods and designs to establish more solid grounds for certain phenomena and relations?

heathercchen commented 4 years ago

This article is huge in topics it tries to cover, and for me, it seems that every section of this article can be further explored in a single book. My comments or initial opinions are quite similar to that of @di-Tong . We cannot have a panorama of a certain period or a certain area only by analyzing the documented materials. Also, regarding the "censorship" section, I have a few ideas or further comments. Geogre Orwell once written in 1984 that, in a highly suppressive regime, people use "less" words not only by forbidding some but also by reducing the meaning of existed expressions. In this section, the authors use the surge of some specific words or the disappearance of others to reflect "censorship" or "suppression". But is there any other better ways to capture this complicated social phenomenon other than counting frequencies? Or in other words, can word frequencies be investigated further?

tzkli commented 4 years ago

This article shows a few promising avenues of research using "high through-put data collection and analysis" (p. 5). The authors rely on digitized English-language books for their analysis. I would be interested in how this could be scaled up to be cross-language and cross-cultural. Can we use existing machine translation to homogenize data from different languages for cross-cultural research? What are the promises and challenges of adopting this approach?

YanjieZhou commented 4 years ago

I notice that this article includes many topics varying from lexicology to censorship, which, as far as I am concerned, restricted by the content capacity of one single article, does not delve into each topic deeply and. Hence, I am wondering which topic we can best investigate and are most likely to make some breakthroughs using such a large corpus that includes around 5 milion books.

sanittawan commented 4 years ago

This's a fun and short article to read, but I'm with @bjcliang-uchi on the censorship part. It seems that the paper attempts to make a causal claim, attributing the fact that fewer mentions of "Marc Chagall" in German books to Nazi censorship. With only counting frequency, how can we be sure that the cause of this is censorship? I accept that it is plausible that it is indeed due to Nazi censorship; however, the research method could have been more rigorous, especially when making a causal claim.

Lizfeng commented 4 years ago

The application of culturomics to the detection of censorship and suppression is particularly intriguing. The method used to detect censorship by comparing two different country's corpus shed light on how we can trace back the forgotten history and reveal the truth. Culturomics could complement conventional historical approaches and give opportunities to reconstruct our past. The research also discovers that "we are forgetting our past faster with each passing year and it is accompanied by a more rapid assimilation of the new". I think this argument is a great starting point for research on the information/knowledge explosion happened in recent years. How does technology has changed people's view about the past and how it has transformed our knowledge structure would be a great topic to look into.

deblnia commented 4 years ago

I have two questions, one about the definition of an n-gram here and another about the theoretical motivations.

N-grams are found, in this piece, by "dividing the number of instances of the n-gram in a given year by the total number of words in the corpus in that year." The units and use of that seem confusing to me. What's the point of a resulting metric with the unit n-gram per word? Is this still accurate for bi-grams and tri-grams? Wouldn't it make more sense to have a sense of relative word frequency (i.e. instances of word/total words)?
The underlying data here is really interesting (and it's super cool that the authors have made it so accessible) but the theoretical motivations for data sampling and use seem scattered. For the account of fame, for example, they use wikipedia as a source of famous people. For censorship, they use the Nazi's list of banned authors. In each case, the secondary source of data could be problematized -- Wikipedia has a gender problem, and well, how do we know that the noted decline in mentions is due to censorship and not due to a general cultural shift towards more fascist tastes? Is there a way to normalize the data in a way that's conducive to assigning causality?

alakira commented 4 years ago

I admit that this kind of large data could contribute to the study of detecting censorship and suppression. However, I still doubt the possibility of learning culture quantitatively from the dataset alone since the number of subscription of each article is unknown. Is there any reason to assume that each weight of the contribution to the culture is the same?

skanthan95 commented 4 years ago

1) On page 180, Jean-Baptiste et al 2010 explain how they measure fame. They count how often a particular person was mentioned in their corpus in a given year. I wonder if the times a person was mentioned in one year was adjusted to the number of books published in a year–I expect that the number of books that were published increased vastly between 1800 and 1950. Furthermore, I expect that in the early 19th century a person did not necessarily have to be mentioned in a book to be “famous.” Many people couldn’t afford books at that time, or couldn’t even read–nevertheless, they knew about important people of their time, such as the king, the pope, a local authority through, e.g., word-of-mouth. Conversely, there may have been heroes and heroines of the common people (think of Robin Hood) that we would not be able to capture well from printed (i.e., at the time, mostly elite) sources.

2) The authors have chosen a corpus of 5 million books from a body of 15 million books with respect to the quality of optical character recognition and metadata. If optical character recognition was an important selection mechanism, what does that tell us about the corpus? Is it biased towards books with a clear and big letter typeset – such as children’s books – and does it maybe exclude books in Gothic script? How does that impact our findings?

luisesanmartin commented 4 years ago

The article is clearly interesting and innovative, and certainly brings light into the evolution and trends of human culture through the analysis of word usage in the corpus analyzed. Nonetheless, I was wondering if we shouldn't somehow take into account the popularity of each individual text analyzed, perhaps by the number of printed copies or a similar indicator. Also, I was wondering which kind of computational issues could the authors have run into. It would certainly be interesting to learn about that, and how were they sorted out.

cindychu commented 4 years ago

This article revealed very interesting facts about human language and culture based on text mining of the huge literature and books. One of its main finding is the dramatic increasing of words in human language, while I am curious which POS drives this growth and how does words of different POS change across time? I guess maybe verbs are more stable and increase more slowly generally, compared with nons?

luxin-tian commented 4 years ago

This short article presents series of fascinating visualizations of a giant text analysis project and draws conclusions about the evolvement of cultural trends. However, as there has been very limited analysis that details the inference underlying the conclusions besides merely intuitive visualizations of data, I personally doubt the rigorousness of some assertions especially those in a somewhat causal-inference style. For example,

Square incidents both led to elevated discussion in English texts (scale shown on the right). Response to the 1989 incident is largely absent in Chinese texts (blue, scale shown on the left), suggesting government censorship.

There is no evidence presented supporting the view that it is the 1989 Tiananmen incidents that led to elevated discussions in English texts. Even though such inference can be quite obvious, the quantitative measure of the cultural trends might be exposed to the limitations of being exaggerated or underestimated.

sunying2018 commented 4 years ago

I am very impressed by the diversity of fields investigate by this article. But this kind of diversity , in fact, limits its ability of detail analysis. For most of the conclusion this article mentioned, we need further evidence to prove the causal inference. Some simultaneous phenomena may be just by chance or the relationship among them will be weaken when we considering more factors, which may further need more rigorous analysis to eliminate the influence of other factors.

vahuja92 commented 4 years ago

The wealth of knowledge that Google Gooks provides for natural language processing is quite incredible. I was curious if/whether Google chooses which books it wants scanned, and if so, whether this could cause selection bias.

It looks like Google Books was mired in a decade long lawsuit about "orphan book," books that are out of print but still under copyright. Google eventually won this lawsuit, but this undoubtedly impact which books Google scanned and added to the corpus 2 years ago. It's fair to guess that books out of print are no longer as culturally relevant today, and so they perhaps would tell us something unique about the time they were published. This points to the fact that there probably is some selection bias in the corpus and thus the study. I would be interested in hearing more about the impact of selection on the findings.

arun-131293 commented 4 years ago

Although the authors talk about n-grams, most of their analysis is based on 1-grams, which makes the work seem very exploratory. This is understandable given the size of the text and large number of strands which the authors seem to follow; nevertheless they also make broad assertions for which a 1-gram analysis will not be sufficient(and perhaps even an n-gram analysis which accounts for the context in which the words are used but still not be sufficient). Comparing frequency of "men" and "women" in published texts, to infer women are winning some battle of the sexes is insufficient methodology since to begin with, it fails to account for the context in which those words are used.

In addition to the above, to support the conclusions the authors reach, it seems they would also need a more serious/in-depth qualitative perspective. Yes it is true that frequency of names of certain blacklisted authors names have gone down in Germany, but is that because of direct censorship or self-censorship (where publishers/authors) just don't talk about certain people. If the latter has a role, then we can't really call it nazi censorship, as censorship is a related but distinct phenomena to self-censorship.

Additionally, the distribution of the categories/genres of books at various eras (which are not considered in the analysis) might explain some of the variance in the results. For instance: "People are getting more famous than ever before, but are forgotten more rapidly than ever" is a statement that assumes that the category/genre distribution of the books is comparable in every era. This is highly unlikely given the transformative changes in readership over the last few centuries post the industrial revolution, which in turn has meant the subject matter of the books have changed. It's possible that the inverse of the author's conclusion is true(The people who were likely to be discussed in books were people whose work turned out to be discussed over a long period of time by the populace).

ziwnchen commented 4 years ago

I find the concept of modeling "culturomics" fascinating. Since this article is mainly built on the corpus of digitized books, I am interested in extending the idea to texts that cannot be published. It is intuitive that getting published itself is a challenge for a lot of texts which contain radical thoughts, minor interests or cult subcultures. Also, getting published and kept until today itself means the book to be a 'survivor "in history. However, for cultural studies, those are not published or kept or interpreted might be of great importance as well. They might be the starting points of some historical events, or the underrepresented folk culture in an old, elite society. Is there any way to "predict" those we lose in history by modeling the available landscape of "culturomics "?

yaoxishi commented 4 years ago

This article is very amazing, which introduces a new pretty reliable method to study the evolution of human thoughts and how humans understand the world over time. I am curious about in addition to counting single words or phrases, is there any way to detect the more structure of a sentence to study how grammar changes? Also, it would be very interesting to connect the pattern or "keyword" in the books to the real work events, to see how people's attitudes towards certain events, or how the public's interests changes over time, thus could inform education and policy making, etc.

VivianQian19 commented 4 years ago

Michel et al.’s article explores interesting cultural trends through analyzing a large corpus of digitized texts. Their corpus includes not only English but also other languages such as French, German, Russian, and Chinese. While using stemming on English texts seems to work fine, stemming seems to be extremely difficult for other languages such as German where one word might contain a whole sentence or Chinese where similarity between words is not easily detectable through adding or removing suffixes. My question is how is stemming done in languages other than English such as in German and Chinese?

chun-hu commented 4 years ago

I'm quite impressed by the scale of the analysis and the method the article used. However, I imagine that the reflection of cultural patterns can also be shown in areas other than word counts. What kinds of measurements can we seek to examine this topic?

meowtiann commented 4 years ago

I am quite impressed by the overall approaches but I am not convinced by the censorship part.

It seems that they did not control for any time/ culture/ socio-economic factors when calculating the suppression index. The index is computed by dividing a word's frequency from the target period (1933-1945) by the mean frequency from the periods before and after the target period (1925-1933). But the target period is also a time of war: the attention of the Media, even without any form of censorship, is expected to shift from arts and literature to military and political content. The level of suppression is therefore not accurate.

They also detect censorship by comparing word frequency across countries: if the frequency of "Tiananmen" increases in the U.S. but decreases in China, it suggests censorship. However, what we know as the ground truth is that the word "Tiananmen," as a symbol of the Chinese Government, is used to describe not just the two incidents but lots of other things including a series of national military parades. Therefore, it just seems that the accuracy of detecting censorship using the methods they suggest is quite low.

I think in general I just feel like that despite the amazing size of text corpus, most of the methods they applied are exploratory rather than analytical (i.e. mostly just counting words). This is understandable given how much computing power is needed to do any analysis on such a tremendous dataset, but these exploratory methods do not, as @clk16 says, "follow a coherent theoretical thread". I am just wondering if there is any case that some more rigorous social science implications are derived from a similarly sizable dataset.

I think your perspective is really but the method in this paper is easy to apply. This paper merely looks at the frequencies of famous people's names which is applicable to any period of time in any country. The method you suggest would be way more useful if we were to understand, for example, WW2 and censorship in WW2.

meowtiann commented 4 years ago

I am quite impressed by the overall approaches but I am not convinced by the censorship part.

It seems that they did not control for any time/ culture/ socio-economic factors when calculating the suppression index. The index is computed by dividing a word's frequency from the target period (1933-1945) by the mean frequency from the periods before and after the target period (1925-1933). But the target period is also a time of war: the attention of the Media, even without any form of censorship, is expected to shift from arts and literature to military and political content. The level of suppression is therefore not accurate.

They also detect censorship by comparing word frequency across countries: if the frequency of "Tiananmen" increases in the U.S. but decreases in China, it suggests censorship. However, what we know as the ground truth is that the word "Tiananmen," as a symbol of the Chinese Government, is used to describe not just the two incidents but lots of other things including a series of national military parades. Therefore, it just seems that the accuracy of detecting censorship using the methods they suggest is quite low.

I think in general I just feel like that despite the amazing size of text corpus, most of the methods they applied are exploratory rather than analytical (i.e. mostly just counting words). This is understandable given how much computing power is needed to do any analysis on such a tremendous dataset, but these exploratory methods do not, as @clk16 says, "follow a coherent theoretical thread". I am just wondering if there is any case that some more rigorous social science implications are derived from a similarly sizable dataset.

Sorry let me rephrase this. It's about the scope of a study and a paper. This paper is about this method as a whole, not specifically dedicated to censorship detection for a specific period of time. The method given can be used even without a profound understanding of the event and the historical and cultural background of such event. But I agree that for someone focusing on one event, this is not enough. It is only a bigger framework that needs to be adjusted and costomized for Tiananmen or WW2.

cytwill commented 4 years ago

I feel like the huge database generated from the digitized text could provide some clues of cultural changes from a linguistic aspect, but I think more explanations of how these changes happened are needed to be investigated. Otherwise, we just see some phenomena (some are quite intuitive), but can not make more theoretical or practical contributions to cultural transition.

Also, there is a point that I do not really agree with: we should remove those obsolete words from the dictionaries. I think that is a kind of disrespect to the past culture, and also people would still need such reference to old words when they need to analyze the past context.

kdaej commented 4 years ago

I found this article very interesting in that it tries to make inferences about culture using a wide range of books in different languages. However, I think this study does not fully account for what might be driving these cultural changes. I wonder what additional information they would have gained by examining the writers and publishers of the books in their corpus. Who are these writers and who are these books written for? Can answers to these questions give more information about cultural evolution?

jsmono commented 4 years ago

The authors approached the problem of censorship in a very innovative way. One thing I'm a bit confused about is how they recognized people who were suppressed. Why did their method work? Was the calculation linked back to where words come from? They explained it in a purely mathematical approach that without much knowledge in statistics and probability, it was a bit hard for me to acquire.

Computational-Content-Analysis-2020 / Readings-Responses

Counting Words & Phrases - Jean-Baptiste et al 2010 #13