Open jamesallenevans opened 4 years ago
The main argument in this paper is that word embeddings (and thus high dimensional vector space models) built from certain texts can capture cultural aspects (meanings, identities, associations, etc.). This claim motivates high-dimensional theorizing of these aspects of the world. However, what can be gleaned from this paper that can be applied to studies of culture that currently goes on in Anthropology departments?
Anthropologist Clifford Geertz, argues in his influential “The Interpretation of Cultures” that anthropology should proceed with the ethnographer immersing himself in a specific socio-cultural context to study a specific ritual/routine(funeral rites in Java, colonial mediation of tribal affairs in Algiers, etc.) with an eye towards gleaning something universal about that culture using subjective interpretation(i.e. expanding outwards). This paper seems to go in the opposite direction by building embeddings from a large dataset using opaque methods and then trying to understand the subtleties of the words used.
Do word associations and the so called subtle differences between their associations in different cultural contexts studied using opaquely constructed world embeddings(opaque both in how they are constructed and in how little room they leave for interpretation) leave room for the kind of nuanced interpretive method advocated by Geertz? How do we solve the problem of "underdetermination" of data in Quine's sense of the word without an embedded interpreter who(however subjective his interpretations) fills the gap between data and theory during the building of the semantic space(and not after the fact), when we build a semantic space just based on text?
How do we hope to understand the “symbolic systems” of meaning through which people make meaning out of their lives, using Vector Space models, especially the ones used in this paper where a single word has one embedding regardless of the context of use? For instance the world “play” in “He wanted to play her like a fiddle”, “We watched a play last weekend” and “Children play at parks” has the same embedding value in each case, so does 'race' in "She won the race' and 'race baiting during the bush era', even though the words themselves means very different things in each case.
This paper is really illuminating and shows the great potential when we apply word embedding technique to social science problems. I notice that this paper involve historical comparison between various embeddings. For example, firgure 5, figure 6 and figure 7 all reveal some historical tendencies based on word embeddings of corpus for 1900s, 1910s, 1920s, etc, and figure 9 is a validation for this comparison. I am a little curious about the comparability between embeddings of different historical periods: the developement of publishing industry, the rise and fall of different countries of the English world, the number of published books, many factors might have influenced the form and stability of embeddings. Does figure 9 aim to solve this problem? What other methods shall we use to rule out all the possible noises?
The findings of this paper "at once align and contend with dominant theories of culture in a number of significant ways" (p. 914), a defining characteristic of good research in computational social science according to Prof. Evans. My question is related to the results that the authors acknowledge are not so easily interpretable. These results are usually inconsistent with both sociological theories and our intuitions. But I'm wondering whether that some results are not interpretable and left uninterpreted would affect the validity of the interpretation of other results. In regression analysis in the social sciences, focusing on results that are consistent with theoretical expectations and leaving out counter-intuitive results would usually raise eyebrows. Do we also need to address this problem in interpreting the results of word embedding and other text analysis models?
The paper's findings are refreshing and novel in many aspects. And I have two (unrelated) problems about it.
Word embeddings are fascinating.
@arun-131293 was kind of getting at this, but I had a similar (but probably more naive) question. This is undoubtably an elegant application of word embeddings, and while the interpretation of the dimensions is carefully grounded in established theory (as it needs to be in any kind of PCA-like process -- Bourdieu especially felt relevant to me), there seems to be some boundary work required to articulate the use of this to people who study culture on a micro-level.
Is the benefit of this method merely scale (e.g. that we can load millions of texts, texts that would take years to read and parse in this way, into a word embedding model for ready conclusions)? That seems to run into the noted limitation of sampling, and what texts you load. There's also a weird chicken and egg situation, because you need that theoretical work cited in the introduction to interpret the dimensions of the word embedding model. So which comes first?
Tl;dr: How would the authors explain the use of this example to ethnographic studiers of culture?
While the authors address issues regarding bias in the corpus used (books digitized by Google is not a random sample of all texts) and the paper further notes that a large corpus is needed for word embeddings to produce useful models, I am wondering if word embeddings could be tool to compare bias between cultures? Specifically, while I know there is no ground truth of culture, could we train word embedding models on a different corpora from different cultures and compare the same words on the same dimensions to understand the differences in biases between them?
On page 936, the authors list out some of the pairs of words used to create dimensions. The primary method they use is antonym pairs. Are there ways to create dimensions for a concept without needing antonyms?
Word2vec has become a well-known algorithm for word embedding. It'd be interesting to know whether the unintentional reaffirmation of existing biases can be treated as overfitting and existing practices for modifying the algorithm that can reduce the biases.
I enjoy this paper a lot! My questions are:
This paper was really interesting and demonstrated the power of word embeddings in revealing semantic relations in the aspects of gender, race and class. Especially for race, there are more nuances that is not captured in the black-white dimension such as mixed race or asian and latinos - I wonder whether we would then need a n-1 dimension space to map a categorical variable with n levels?
Also, the word2vec models used in this paper are from sources where the relatively standard English is likely to dominate. Do these models work well even with social media, or text messaging data that tend to have many more short forms and include emojis?
Appendix part H compares the meaning of class in general discourse with the meaning of class in sociological discourse. For the coding homework this week, I have analyzed how people speak about ‘work.’ I found that terms that some words that sociologists use to discuss work, e.g., ‘salary’ and ‘solidarity,’ did not appear in my dataset. It was easy for me to replace the former, e.g. with ‘dollar’ and ‘money’ and hence to find the corresponding words in the general discourse. However, I still struggle to find words that the interviewees use to invoke ‘solidarity.’ How can one deal with such a situation?
This paper is really fascinating to analyze the cultural dimensions of class using word embeddings. I'm very interested in the techniques of dimension reductions mentioned in this paper. On page 910, "From efficiency considerations, SVD placed strict upper limits on the number of documents and lower limits on the size of semantic contexts they could factorize". The upper limits here is quite straightforward to understand, but I am wondering why it also sets the lower limits on the size of semantic contexts?
I think the method of the paper to analyze culture is very fascinating especially the part that talks about why certain related words contradict intuitions. The authors pointed out that "multiple dimensions of the class identified in sociological theory comprise a complex yet stable semantic structure" which led them to build a complex structure for the analysis. But is there an acme for accuracy in relation to the number of dimensions? How can we know what number suits the structure best? Is there any means like the silhouette diagram we learned last week to examine the best number?
This paper introduces a really exciting application of word embedding in revealing social changes. I have a few questions in regards to the word embedding technique:
As the paper explains the methods of measuring distances between words in an embedding space, the authors suggest that "...[cosine similarity] is preferred to the Euclidean (straight-line) distance due to properties of high-dimensional spaces that violate intuitions formed in two or three dimensions". Can anyone help explain this more explicitly? As far as I am concerned, as long as "we normalize all word vectors such that they lie on the surface of a hypersphere of the same dimensionality as the space", cosine similarity is somewhat proportional to Euclidean distance, since all the words are residing on the "surface of the hypersphere".
Why does the SVD method that is used earlier for word embedding give the desired result? Can one interpret the result of Singular Value Decomposition in the same way as for the word2vec method?If so, why can SVD give a space in which any two close vectors represent two words with similar semantic meaning?
Fascinating application of word-embedding for sociological puzzles! I wonder if it is possible to build dimensions not in a binary fashion (owner-worker), since contemporary class/stratification theories hardly conceptualize a society with only two classes, especially in terms of occupation. Maybe by using each class's vectors (except for the lowest one) to minus the vectors for the lowest class?
I wonder how we should take into consideration the semantic change into these diachronic analyses. And if the meanings of words changed over time, how should we differentiate the semantic changes to social changes?
This is a very useful paper and very fascinating topic! From the perspective of using the words with cultural associations in embedding model, how often are the lists of such words updated? Since new words come out and die out on social media much more quickly now in the digital age, are there ways to translate that into compiling such word lists for research?
For word imbedding, different from word co-occurance, it generates continuous ‘distribution’ both by direction and distance between words based on the ‘context' of them. However, I am always wondering how to extract relational information from the ‘context’ properly? how distances are computed from the ‘context’ of words?
As @arun-131293 has pointed out, this paper's central argument is that word embeddings built from certain texts can capture certain cultural aspects, such as identities and associations - including the gendered valence of a word:
"…Similarly, the researcher can determine if “jazz” is more masculine or feminine than “opera” by projecting these words onto the dimension corresponding to gender in the same space."
I'm thinking of Boroditsky et al.'s 2003 publication, and wondering how you might control for the effects of grammatical gender on object (or concept) perception across languages (e.g., speakers of a language that genders "key" as masculine might implicitly view the object as more masculine than speakers of a language that genders it as female instead) -- and, whether you think that this concern is valid, given that this is a point of contention in the linguistics world.
Using the word embedding method to detect high-level social changes from a large scale of texts is fascinating! I am wondering that in spite of capture the social change, which is at the phenomenon level, is there any way to apply word embedding method or other text analysis techniques to further capture the mechanisms behind these phenomenons? or offer further interpretations?
This paper is really interesting and provides a unique way to understand how the cultural meaning of words throughout time and across cultures has changed. I wonder how/whether word embedding techniques can assist with understanding the cause of such changes? As a few other posts point out, anthropological cultural research often dives deeply into one cultural phenomenon to understand it and broader cultural phenomenon as well. Is it possible to go deeper to understand why certain language changes have occurred using word embedding techniques?
This paper shows how Word embeddings approach is powerful for capturing embedding culture that otherwise difficult to know by hand. Since the results seems to depend heavily on corpus pre-processing, how does the same analysis with different subsets of data, different weights of each data, or different lemmatization on each word change the results?
This paper is inspiring in a way that it shows us an interesting unite of social science and machine learning of word embedding. By looking at how bias change throughout certain historical context. This method is thus, able to track changes in time. But I also wonder, in comparison across time, how do we determine the "baseline" word embedding?
There are a lot of great questions and comments here in terms of the paper's theoretical construct and operationalization of notions of gender, race, and social class. My question is actually less about theory and more about implementation. I'm trying to wrap my head around the processing of the Google NGrams dataset, and the millions of books published over 100 years.
What was the processing pipeline like for this project considering the immense size of the data? In class we deal with collections that are a few thousand observations, but for truly large-scale analysis done in the article, what sort of data engineering is necessary to query, analyze, and store information? Was data moved to a SQL database, was analysis done in batches, and if so, how was batch processing set up?
As I begin to work with the NOW Davies corpora, which is pretty large, it would be helpful to learn more!
@arun-131293 was kind of getting at this, but I had a similar (but probably more naive) question. This is undoubtably an elegant application of word embeddings, and while the interpretation of the dimensions is carefully grounded in established theory (as it needs to be in any kind of PCA-like process -- Bourdieu especially felt relevant to me), there seems to be some boundary work required to articulate the use of this to people who study culture on a micro-level.
Is the benefit of this method merely scale (e.g. that we can load millions of texts, texts that would take years to read and parse in this way, into a word embedding model for ready conclusions)? That seems to run into the noted limitation of sampling, and what texts you load. There's also a weird chicken and egg situation, because you need that theoretical work cited in the introduction to interpret the dimensions of the word embedding model. So which comes first?
Tl;dr: How would the authors explain the use of this example to ethnographic studiers of culture?
The Chicken and egg paradox is really interesting. But I guess qualitative preliminary/pilot studies can come first, or daily experiences and thoughts come first, not necessarily well-constructed theories. Still, I think, the better the theory, the better the start point we have to do word embedding.
I think the biases in the collection of corpora cannot be avoided because a large amount of corpus needs a lot of combinations of many texts, which inevitably includes collectors' subjective decisions to choose the proper material, but as far as I am concerned, this fact still cannot unqualify the corpus' values for uncovering what the related culture is about.
I think this paper resembles the paper of “Semantics derived automatically from language corpora contain human-like biases.” They both use the word embedding techniques to explore cultural biases in people's use of words. For these biases, I do share a similar notion with @heathercchen, I am wondering if there is any other evidence to support the evolutions of the biases. For example, if the words close to the African-American in the race dimension are gradually less attached to the "working class" in the affluence dimension, I think we might be able to compare the income level of African-American or other socioeconomic indexes to see if people's language preference is ahead of, consistent with or lag behind the real evolutions of different social groups.
I find this paper rather interesting and inspiring. One question I'm wondering is when choosing and building cultural dimensions, when could we know that those dimensions are necessary and sufficient in measuring cultural concepts and cultural dynamics?
This study limits its ability to generalize the results to a relatively “elite literary public” than the general public. I am curious how the results might be different if the data were built by the general public. I understand that it is difficult to retrieve such data from the 1900s. However, I think it’s at least possible to attain texts in the 2000s and make comparisons between public data and elite ones to examine how different or similar they are. This can be useful to make inference about if there are any “classes” missing out or underrepresented in the elite data.
The idea of using word embeddings to explain social phenomena in this paper is fascinating. My question is similar to that of @laurenjli. How does one generate a list of word pairs and how can you be sure that the list is extensive/exhaustive?
I’m a bit confused about the various validation approaches used in the article. If I understand correctly, the Mturk surveys are used for general validation of the word embedding approach. For contemporary validation, the authors use the Google Ngram corpus from 2000 to 2012, but isn’t Google Ngrams used for training word embedding models? And since the 1958 social psychology study is used for the validation of word embedding model trained on 1950-1959 Google Ngram corpus, how are the other decades of the corpus validated since the Google Ngram corpus ranges from 1900 to 1999? And why we need to differentiate between the decades before and after 1999 (where different cultural dimensions are constructed for validation)?
Kozlowski, Austin, Matt Taddy, James Evans. 2019. “The Geometry of Culture: Analyzing the Meanings of Class through Word Embeddings.” American Sociological Review 85(5):905-949.