Exploring Semantic Spaces - Kozlowski, Taddy, Evans 2019

jamesallenevans commented 4 years ago

Kozlowski, Austin, Matt Taddy, James Evans. 2019. “The Geometry of Culture: Analyzing the Meanings of Class through Word Embeddings.” American Sociological Review 85(5):905-949.

arun-131293 commented 4 years ago

The main argument in this paper is that word embeddings (and thus high dimensional vector space models) built from certain texts can capture cultural aspects (meanings, identities, associations, etc.). This claim motivates high-dimensional theorizing of these aspects of the world. However, what can be gleaned from this paper that can be applied to studies of culture that currently goes on in Anthropology departments?

Anthropologist Clifford Geertz, argues in his influential “The Interpretation of Cultures” that anthropology should proceed with the ethnographer immersing himself in a specific socio-cultural context to study a specific ritual/routine(funeral rites in Java, colonial mediation of tribal affairs in Algiers, etc.) with an eye towards gleaning something universal about that culture using subjective interpretation(i.e. expanding outwards). This paper seems to go in the opposite direction by building embeddings from a large dataset using opaque methods and then trying to understand the subtleties of the words used.

Do word associations and the so called subtle differences between their associations in different cultural contexts studied using opaquely constructed world embeddings(opaque both in how they are constructed and in how little room they leave for interpretation) leave room for the kind of nuanced interpretive method advocated by Geertz? How do we solve the problem of "underdetermination" of data in Quine's sense of the word without an embedded interpreter who(however subjective his interpretations) fills the gap between data and theory during the building of the semantic space(and not after the fact), when we build a semantic space just based on text?

How do we hope to understand the “symbolic systems” of meaning through which people make meaning out of their lives, using Vector Space models, especially the ones used in this paper where a single word has one embedding regardless of the context of use? For instance the world “play” in “He wanted to play her like a fiddle”, “We watched a play last weekend” and “Children play at parks” has the same embedding value in each case, so does 'race' in "She won the race' and 'race baiting during the bush era', even though the words themselves means very different things in each case.

lkcao commented 4 years ago

This paper is really illuminating and shows the great potential when we apply word embedding technique to social science problems. I notice that this paper involve historical comparison between various embeddings. For example, firgure 5, figure 6 and figure 7 all reveal some historical tendencies based on word embeddings of corpus for 1900s, 1910s, 1920s, etc, and figure 9 is a validation for this comparison. I am a little curious about the comparability between embeddings of different historical periods: the developement of publishing industry, the rise and fall of different countries of the English world, the number of published books, many factors might have influenced the form and stability of embeddings. Does figure 9 aim to solve this problem? What other methods shall we use to rule out all the possible noises?

tzkli commented 4 years ago

The findings of this paper "at once align and contend with dominant theories of culture in a number of significant ways" (p. 914), a defining characteristic of good research in computational social science according to Prof. Evans. My question is related to the results that the authors acknowledge are not so easily interpretable. These results are usually inconsistent with both sociological theories and our intuitions. But I'm wondering whether that some results are not interpretable and left uninterpreted would affect the validity of the interpretation of other results. In regression analysis in the social sciences, focusing on results that are consistent with theoretical expectations and leaving out counter-intuitive results would usually raise eyebrows. Do we also need to address this problem in interpreting the results of word embedding and other text analysis models?

heathercchen commented 4 years ago

The paper's findings are refreshing and novel in many aspects. And I have two (unrelated) problems about it.

See Figure 6 in p.924, the coefficients before education when controlled for cultivation experienced a sharp increase from the 1960s to the 1980s. I am wondering what causes this increase? For example, the establishment of an important policy, an extension in college enrolment, or a broader social movement.
I am wondering about the validity of using affluence projection as a benchmark of social stratification, especially when we further compare it with other social dimensions (e.g. as the dependent variable of the regression and on the other side of correlation analysis). Since the authors have already obtained the "distance" (social inequality) measured by affluence projection and how it evolves over the 20th century, can we compare it with other verified and well-known social stratification measurements such as the Gini coefficient? If they stratification measurement we captured here from texts has a robust correlation with presented measurement and displays a similar trend, it might better validate the results in this paper.

ccsuehara commented 4 years ago

Word embeddings are fascinating.

I was wondering if the big categories we care about, in the reading we refer to as poor/rich, or women/men, may necessarily have to be antagonistic, and if the text will respond to any possible spectrum (even a nonsense one like cat/donkey)
Also, it is interesting to see that word embeddings have their origin in the SVD model. (In figure 1, the nm matrix is the result of the nk and k*m) I wanted to know more about how to get the k dimensions.
Minor questions:
In figure 6, education + cult.control is the same regression as cultivation+ edu. Control?
Are there any examples in which word embeddings have been used along with clustering?

deblnia commented 4 years ago

@arun-131293 was kind of getting at this, but I had a similar (but probably more naive) question. This is undoubtably an elegant application of word embeddings, and while the interpretation of the dimensions is carefully grounded in established theory (as it needs to be in any kind of PCA-like process -- Bourdieu especially felt relevant to me), there seems to be some boundary work required to articulate the use of this to people who study culture on a micro-level.

Is the benefit of this method merely scale (e.g. that we can load millions of texts, texts that would take years to read and parse in this way, into a word embedding model for ready conclusions)? That seems to run into the noted limitation of sampling, and what texts you load. There's also a weird chicken and egg situation, because you need that theoretical work cited in the introduction to interpret the dimensions of the word embedding model. So which comes first?

Tl;dr: How would the authors explain the use of this example to ethnographic studiers of culture?

katykoenig commented 4 years ago

While the authors address issues regarding bias in the corpus used (books digitized by Google is not a random sample of all texts) and the paper further notes that a large corpus is needed for word embeddings to produce useful models, I am wondering if word embeddings could be tool to compare bias between cultures? Specifically, while I know there is no ground truth of culture, could we train word embedding models on a different corpora from different cultures and compare the same words on the same dimensions to understand the differences in biases between them?

laurenjli commented 4 years ago

On page 936, the authors list out some of the pairs of words used to create dimensions. The primary method they use is antonym pairs. Are there ways to create dimensions for a concept without needing antonyms?

HaoxuanXu commented 4 years ago

Word2vec has become a well-known algorithm for word embedding. It'd be interesting to know whether the unintentional reaffirmation of existing biases can be treated as overfitting and existing practices for modifying the algorithm that can reduce the biases.

bjcliang-uchi commented 4 years ago

I enjoy this paper a lot! My questions are:

How much computation power is needed for such a complicated multidimensional calculation?
To which extent are the criteria of dimensions (e.g. affluence, employment, status, education, and cultivation)--words that represent these dimensions--themselves change over time? If they change a lot too, how does the relativeness of words adjust to this systematic shift? I believe that the paper somehow mentions this but I am not entirely sure that I understood this part.
Would the nouns and verbs (that is, the POS of words), matter to their positions in the multidimensional semantic space? For example, how much is "conservative" different from "conservatism"?

rachel-ker commented 4 years ago

This paper was really interesting and demonstrated the power of word embeddings in revealing semantic relations in the aspects of gender, race and class. Especially for race, there are more nuances that is not captured in the black-white dimension such as mixed race or asian and latinos - I wonder whether we would then need a n-1 dimension space to map a categorical variable with n levels?

Also, the word2vec models used in this paper are from sources where the relatively standard English is likely to dominate. Do these models work well even with social media, or text messaging data that tend to have many more short forms and include emojis?

ckoerner648 commented 4 years ago

Appendix part H compares the meaning of class in general discourse with the meaning of class in sociological discourse. For the coding homework this week, I have analyzed how people speak about ‘work.’ I found that terms that some words that sociologists use to discuss work, e.g., ‘salary’ and ‘solidarity,’ did not appear in my dataset. It was easy for me to replace the former, e.g. with ‘dollar’ and ‘money’ and hence to find the corresponding words in the general discourse. However, I still struggle to find words that the interviewees use to invoke ‘solidarity.’ How can one deal with such a situation?

sunying2018 commented 4 years ago

This paper is really fascinating to analyze the cultural dimensions of class using word embeddings. I'm very interested in the techniques of dimension reductions mentioned in this paper. On page 910, "From efficiency considerations, SVD placed strict upper limits on the number of documents and lower limits on the size of semantic contexts they could factorize". The upper limits here is quite straightforward to understand, but I am wondering why it also sets the lower limits on the size of semantic contexts?

jsmono commented 4 years ago

I think the method of the paper to analyze culture is very fascinating especially the part that talks about why certain related words contradict intuitions. The authors pointed out that "multiple dimensions of the class identified in sociological theory comprise a complex yet stable semantic structure" which led them to build a complex structure for the analysis. But is there an acme for accuracy in relation to the number of dimensions? How can we know what number suits the structure best? Is there any means like the silhouette diagram we learned last week to examine the best number?

luxin-tian commented 4 years ago

This paper introduces a really exciting application of word embedding in revealing social changes. I have a few questions in regards to the word embedding technique:

As the paper explains the methods of measuring distances between words in an embedding space, the authors suggest that "...[cosine similarity] is preferred to the Euclidean (straight-line) distance due to properties of high-dimensional spaces that violate intuitions formed in two or three dimensions". Can anyone help explain this more explicitly? As far as I am concerned, as long as "we normalize all word vectors such that they lie on the surface of a hypersphere of the same dimensionality as the space", cosine similarity is somewhat proportional to Euclidean distance, since all the words are residing on the "surface of the hypersphere".
Why does the SVD method that is used earlier for word embedding give the desired result? Can one interpret the result of Singular Value Decomposition in the same way as for the word2vec method？If so, why can SVD give a space in which any two close vectors represent two words with similar semantic meaning?

di-Tong commented 4 years ago

Fascinating application of word-embedding for sociological puzzles! I wonder if it is possible to build dimensions not in a binary fashion (owner-worker), since contemporary class/stratification theories hardly conceptualize a society with only two classes, especially in terms of occupation. Maybe by using each class's vectors (except for the lowest one) to minus the vectors for the lowest class?

chun-hu commented 4 years ago

I wonder how we should take into consideration the semantic change into these diachronic analyses. And if the meanings of words changed over time, how should we differentiate the semantic changes to social changes?

wunicoleshuhui commented 4 years ago

This is a very useful paper and very fascinating topic! From the perspective of using the words with cultural associations in embedding model, how often are the lists of such words updated? Since new words come out and die out on social media much more quickly now in the digital age, are there ways to translate that into compiling such word lists for research?

cindychu commented 4 years ago

For word imbedding, different from word co-occurance, it generates continuous ‘distribution’ both by direction and distance between words based on the ‘context' of them. However, I am always wondering how to extract relational information from the ‘context’ properly? how distances are computed from the ‘context’ of words?

skanthan95 commented 4 years ago

As @arun-131293 has pointed out, this paper's central argument is that word embeddings built from certain texts can capture certain cultural aspects, such as identities and associations - including the gendered valence of a word:

"…Similarly, the researcher can determine if “jazz” is more masculine or feminine than “opera” by projecting these words onto the dimension corresponding to gender in the same space."

I'm thinking of Boroditsky et al.'s 2003 publication, and wondering how you might control for the effects of grammatical gender on object (or concept) perception across languages (e.g., speakers of a language that genders "key" as masculine might implicitly view the object as more masculine than speakers of a language that genders it as female instead) -- and, whether you think that this concern is valid, given that this is a point of contention in the linguistics world.

yaoxishi commented 4 years ago

Using the word embedding method to detect high-level social changes from a large scale of texts is fascinating! I am wondering that in spite of capture the social change, which is at the phenomenon level, is there any way to apply word embedding method or other text analysis techniques to further capture the mechanisms behind these phenomenons? or offer further interpretations?

vahuja92 commented 4 years ago

This paper is really interesting and provides a unique way to understand how the cultural meaning of words throughout time and across cultures has changed. I wonder how/whether word embedding techniques can assist with understanding the cause of such changes? As a few other posts point out, anthropological cultural research often dives deeply into one cultural phenomenon to understand it and broader cultural phenomenon as well. Is it possible to go deeper to understand why certain language changes have occurred using word embedding techniques?

alakira commented 4 years ago

This paper shows how Word embeddings approach is powerful for capturing embedding culture that otherwise difficult to know by hand. Since the results seems to depend heavily on corpus pre-processing, how does the same analysis with different subsets of data, different weights of each data, or different lemmatization on each word change the results?

yirouf commented 4 years ago

This paper is inspiring in a way that it shows us an interesting unite of social science and machine learning of word embedding. By looking at how bias change throughout certain historical context. This method is thus, able to track changes in time. But I also wonder, in comparison across time, how do we determine the "baseline" word embedding?

rkcatipon commented 4 years ago

There are a lot of great questions and comments here in terms of the paper's theoretical construct and operationalization of notions of gender, race, and social class. My question is actually less about theory and more about implementation. I'm trying to wrap my head around the processing of the Google NGrams dataset, and the millions of books published over 100 years.

What was the processing pipeline like for this project considering the immense size of the data? In class we deal with collections that are a few thousand observations, but for truly large-scale analysis done in the article, what sort of data engineering is necessary to query, analyze, and store information? Was data moved to a SQL database, was analysis done in batches, and if so, how was batch processing set up?

As I begin to work with the NOW Davies corpora, which is pretty large, it would be helpful to learn more!

meowtiann commented 4 years ago

@arun-131293 was kind of getting at this, but I had a similar (but probably more naive) question. This is undoubtably an elegant application of word embeddings, and while the interpretation of the dimensions is carefully grounded in established theory (as it needs to be in any kind of PCA-like process -- Bourdieu especially felt relevant to me), there seems to be some boundary work required to articulate the use of this to people who study culture on a micro-level.

Is the benefit of this method merely scale (e.g. that we can load millions of texts, texts that would take years to read and parse in this way, into a word embedding model for ready conclusions)? That seems to run into the noted limitation of sampling, and what texts you load. There's also a weird chicken and egg situation, because you need that theoretical work cited in the introduction to interpret the dimensions of the word embedding model. So which comes first?

Tl;dr: How would the authors explain the use of this example to ethnographic studiers of culture?

The Chicken and egg paradox is really interesting. But I guess qualitative preliminary/pilot studies can come first, or daily experiences and thoughts come first, not necessarily well-constructed theories. Still, I think, the better the theory, the better the start point we have to do word embedding.

YanjieZhou commented 4 years ago

I think the biases in the collection of corpora cannot be avoided because a large amount of corpus needs a lot of combinations of many texts, which inevitably includes collectors' subjective decisions to choose the proper material, but as far as I am concerned, this fact still cannot unqualify the corpus' values for uncovering what the related culture is about.

cytwill commented 4 years ago

I think this paper resembles the paper of “Semantics derived automatically from language corpora contain human-like biases.” They both use the word embedding techniques to explore cultural biases in people's use of words. For these biases, I do share a similar notion with @heathercchen, I am wondering if there is any other evidence to support the evolutions of the biases. For example, if the words close to the African-American in the race dimension are gradually less attached to the "working class" in the affluence dimension, I think we might be able to compare the income level of African-American or other socioeconomic indexes to see if people's language preference is ahead of, consistent with or lag behind the real evolutions of different social groups.

ziwnchen commented 4 years ago

I find this paper rather interesting and inspiring. One question I'm wondering is when choosing and building cultural dimensions, when could we know that those dimensions are necessary and sufficient in measuring cultural concepts and cultural dynamics?

kdaej commented 4 years ago

This study limits its ability to generalize the results to a relatively “elite literary public” than the general public. I am curious how the results might be different if the data were built by the general public. I understand that it is difficult to retrieve such data from the 1900s. However, I think it’s at least possible to attain texts in the 2000s and make comparisons between public data and elite ones to examine how different or similar they are. This can be useful to make inference about if there are any “classes” missing out or underrepresented in the elite data.

sanittawan commented 4 years ago

The idea of using word embeddings to explain social phenomena in this paper is fascinating. My question is similar to that of @laurenjli. How does one generate a list of word pairs and how can you be sure that the list is extensive/exhaustive?

VivianQian19 commented 4 years ago

I’m a bit confused about the various validation approaches used in the article. If I understand correctly, the Mturk surveys are used for general validation of the word embedding approach. For contemporary validation, the authors use the Google Ngram corpus from 2000 to 2012, but isn’t Google Ngrams used for training word embedding models? And since the 1958 social psychology study is used for the validation of word embedding model trained on 1950-1959 Google Ngram corpus, how are the other decades of the corpus validated since the Google Ngram corpus ranges from 1900 to 1999? And why we need to differentiate between the decades before and after 1999 (where different cultural dimensions are constructed for validation)?

Computational-Content-Analysis-2020 / Readings-Responses

Exploring Semantic Spaces - Kozlowski, Taddy, Evans 2019 #32