4. Word Embeddings to Explore Meaning Spaces- orienting

lkcao commented 8 months ago

Post questions here for this week's oritenting readings:

Kozlowski, Austin, Matt Taddy, James Evans. 2019. “The Geometry of Culture: Analyzing the Meanings of Class through Word Embeddings.” American Sociological Review 85(5):905-949.

XiaotongCui commented 7 months ago

This research is very interesting! Moreover, it resonates with the workshop last Thursday. The professor from Berkeley seemed to have used a similar method! I have a question regarding 'How large will the data set be large enough for this research method' because the text mentions: 'word embeddings must be trained on very large corpora if the output vector space is to capture subtle and complex associations of interest to culture analysts.' However, in the research, how do we rigorously determine that the dataset is sufficiently large? Because if a corpus is very large, it becomes challenging to find other datasets of comparable size to perform robustness checks and demonstrate that our results are robust, not merely a random outcome.

sborislo commented 7 months ago

I think Kozlowski and (our own) Evans provide a strong case for the use of word embedding analyses to support the investigation of cultural phenomena. Although there are limitations, I have no qualms with them. However, I feel inclined to ask: How accessible are word embedding analyses for most researchers? If one is exploring truly unknown or surprising relationships, how can that person know the analyses were done correctly? (From doing the coding assignments, I can easily see a researcher performing the analyses outlined by the authors, stumbling upon interesting relationships, only to discover those relationships were found due to suboptimal use of the necessary methods)

Twilight233333 commented 7 months ago

In fact, I am curious about how the author defines each dimension. In the original text, the author chooses to classify some words, such as opera and jazz, but do these words also change over time, for example, now most people can afford opera, and whether this will weaken the credibility of the author's classification? How to solve this problem?

chanteriam commented 7 months ago

In reading the article, though I understand how word embeddings are created, particularly in defining their relationship to other words similar to analogies, I am still confused on how cultural dimensions and contexts are identified in the vector space, particularly this math:

$$\Sigma_p^{|P|}\frac{p_1-p_2}{|P|}$$

The article explains that $p_1$ and $p_2$ are all relevant antonym pairs in set P. Is set P the set of words that characterize our cultural dimension (so, with affluence: rich, poor, tennis, steak, etc.)? Additionally, would p be a antonym vector pair like (rich, poor) and $p_1$ and $p_2$ the vectors for rich and poor?

QIXIN-LIN commented 7 months ago

In today's digital era, a significant amount of information and text found on the internet, especially those related to emerging subcultures. The thriving nature of subcultures and the growing volume of data they generate provide a unique opportunity for the research. Can we effectively utilize Word Embeddings to analyze trends within specific sub-cultures as is? Or do we need to tailor our approach to account for the distinctive nature of subcultural language? Subcultures often have unique linguistic expressions, slang, and connotations that might not be widespread or might even have different meanings in general discourse.

So, what should we be aware of if we want to analyze trends within a specific sub-culture using Word Embeddings? Do we need to make adjustments to our methodologies to accommodate the unique linguistic characteristics of subcultures?

bucketteOfIvy commented 7 months ago

This paper studies relations along multiple theoretically known dimensions (e.g. affluence, gender, education, morality, etc) to better understand conceptions of class over time. The approach taken in this paper is powerful, particularly when handling known categories that influence culture.

However, at times there are unknown, unlabeled, or otherwise slippery categories that we might be interested in studying. As an example, "homosexuality" as an identity and solitary category is relatively new in Europe/America. Medieval Europe lumped it into the much broader category of "sodomy," which essentially included any sin relating to sex. Nonetheless, it is unlikely that there were not individuals who (in today's parlance) would identify as "gay" in medieval Europe, and it's also unlikely that those living in medieval Europe would not have ideas about same-sex attraction.

All in all, this makes the "same-sex versus opposite-sex attraction" dimension an interesting one, albeit one that might be impossible to depict through antonym-word pairs. Are there ways to unpack slippery dimensions like this one when using word embeddings? Or, in terms of the example, are there methods for separating the "same-sex vs opposite-sex attraction" dimension from the overarching dimension of "sodomy vs moral sexual conduct" given a sufficient corpus of texts from medieval Europe?

ana-yurt commented 7 months ago

I think this paper is a fascinating example of how large-scale textual analysis engages with prominent theoretical traditions. find it convincing that the methodology captures meaningful cultural categories rather than simple biases. I am also impressed by the ability of high-dimensional space to capture nuanced, intersectional cultural associations and their variation over the century. I have some questions in my experimentation with word-embedding. 1) Without the resources for surveys, how do we choose and validate the 'antonym pairs' to operationalize cultural constructs such as class or morality, especially since there are many potential choices of words? 2) As mentioned in the paper, word embedding tends to flatten out cultural heterogeneity within a culture. Considering that some of us may be using corpus that is not representative samples of an entire culture, what insights can we possibly gain from those "skewed samples" of cultural niche?

alejandrosarria0296 commented 7 months ago

The paper does a fantastic work of showing how operationalizing many of the social elements that are associated with class in the paper as a binary (gender as woman-man, race as white-black) can lead to interesting insights on the signifiers of class in those contexts. I'm curious about the possibilities of word embeddings if the dimensionality of variables were expanded to cover a complex view of certain constructs. Two relevant contexts that pop into mind where this may be relevant are analysis of race in colonial south america, where race was conceived as a 3D space between spanish, indigenous and black; and signifiers of party identity in multiparty contexts such as most of the european union and latin america. How could word embedding be used in this contexts? Is it the right method or is the "binarity" of the variables a condition for its application?

muhua-h commented 7 months ago

Given our frequent discussions about novelty in lectures, I'm curious about Dr. Evans' perspective on the novelty of this study during its design and execution.

Upon reading the paper, I found myself oscillating between feelings of amazement and recognition of the obvious. My observations include:

The integration of embeddings with the conceptual understanding of social class was intriguing. However, the methodology of analyzing word distances using word2vec, introduced in 2013, felt familiar, considering this study was conducted five years later.
The use of time series analysis to observe temporal changes was enlightening. Nevertheless, some findings merely reinforced existing notions regarding the association of social class with factors like race, gender, and occupation.

As a researcher who both conducts and evaluates studies, Dr. Evans, how do you balance the use of ‘familiar’ methodologies with the generation of new insights in cultural dynamics, especially in the context of social class? Additionally, could you share any surprises or challenges you faced in situating this study within the expansive field of computational sociology and cultural analysis?

Marugannwg commented 7 months ago

I wonder if embedding can be equivalently helpful when working on a corpus that is not as large as the Google Ngram. If I'm understanding correctly, the massiveness of the project scaled-over the noise among single literature and revealed the cultural consensual level biases that prominently expressed in all texts. I have several basic questions to raise:

It is intuitive that "tennis" is closer to "rich" while "boxing" is closer to "poor". What if an embedding from another corpus does not replicate this baseline data, how to interpret this difference?
If a corpus contains text from people with opposite opinions on a certain topic, will the embedding eliminate the topic they are actually debating about while only revealing those they have consensus on?

yueqil2 commented 7 months ago

In the paper, authors demonstrate a "statistically significant, positive association between human-rated associations and embedding projection on all dimension" (Figure 4). It seems that word embedding model presents better for obvious features, preferably spectrum with valance. Does it mean that this model lacks utility in finding invisible, deep, counterintuitive relationships? Does it mean the researcher is supposed to have a solid theory or hypothesis to employ this model?

yuzhouw313 commented 7 months ago

Based on the technical definition of word embedding proposed in this research as "words sharing similar contexts within the text will be positioned nearby in the space" (p.910) and my previous experience using word embedding, I found that it is often the case that both synonyms and antonyms appear closely in a word space they they are likely to appear in similar context ("This person is very elegant in manner" versus "this person is very crude in manner"). However, this phenomenon makes me wonder if word embedding technique might imbue noise and potentially counter the findings in the semantic evolution of social class presented in this research as not only will “doctor” and “lawyer” be positioned close to one another but also will "elegant" and "crude"? Furthermore, how does the use of cultural dimension solve this issue?

cty20010831 commented 7 months ago

I am very impressed by how word2vec can be applied to study the cultural meanings of class over time. One question I have is related to methodology used to extract "meanings" from text. Specifically, I am curious why standard semantic network analysis has trouble in distinguishing between concepts that are close or distant via considering topological information alone with the increasing of corpora? Besides, are there any countermeasures (such as modified or advanced version of semantic network analysis) that can handle large corpora?

volt-1 commented 7 months ago

The association between education and affluence has grown stronger, while its association with cultural taste has weakened, suggests a significant shift in societal class perceptions. I wonder how can we interpret the changing role of education in the context of social class, and what impact does this have on the structure of modern society?

runlinw0525 commented 7 months ago

This study's use of word embeddings to unpack the cultural dimensions of social class opens up intriguing possibilities for analyzing other complex societal issues. For example, how might we apply word embeddings to explore changes in societal attitudes towards environmental issues? Could we analyze the shift in language surrounding terms like "global warming" or "sustainability" over time? Or, consider the realm of politics: could we use word embeddings to trace how political discourse has evolved, particularly around terms like "liberal," "conservative," or "populism"?

yunfeiavawang commented 7 months ago

This paper is impressive in linking social class and expression mode with the concept of affluence. It is enlightening to show that dimensions of word embedding vector spaces correspond closely to "cultural dimensions". I came up with the idea that we can use word embeddings to analyze the textual data from mainstream culture and subculture. Based on the comparison of the two corpora, we can see how subculture was generated from the mainstream culture and keeps evolving as a branch of it or even an opposite of it.

Audacity88 commented 7 months ago

The authors detail their use of five thesauri to create pairs of antonyms which cover both contemporary and historical usage, e.g. "swanky-basic" and "flush-skint". They also note that "A word’s nearest neighbors [in vector space] are often either its synonyms or syntactic variants." So, it seems it would be possible to create a thesaurus from word2vec by simply listing words nearby in vector space as synonyms (after eliminating syntactic variants). However, I wonder if the same is true for antonyms. Are the words farthest apart in the many-dimensional space (or on the surface of the hypersphere) generally antonyms? Or do antonyms only emerge when meaning is projected onto a single dimension, such as, in this case, class?

joylin0209 commented 7 months ago

This is a very interesting paper, and I am curious, given the application of word embedding models to empirical social class, are there some social class-related issues that may not be effectively captured or explained? Is the model able to provide comprehensive and accurate insights into the intertwined factors, or is the model limited by a specific cultural background or context? Have some class-related issues been ignored due to the simplification of the model structure? How to ensure the scope and limitations of the model?

Vindmn1234 commented 7 months ago

I would be curious to ask the author the following questions about his research: How do you account for the evolution of language and meaning when constructing cultural dimensions, given that words like "rich" and "poor" may change in connotation over time? And also since embeddings like word2vec creates an uniform embedding for each word regardless the surrounding content, then given the context-dependent nature of word meanings, how do you ensure that the vectors used to represent cultural dimensions are robust across different contexts? Lastly, could this methodology be extended to analyze other semantic categories, such as emotions or moral values, and their representation in cultural discourse?

ethanjkoz commented 7 months ago

I had a question with regards to the surveys of cultural association section. Perhaps naive, but I was curious as to why the range for such questions was between 0-100. How much more information do you gain from this rather than a some other likert scale like 1-7. Are respondents not more prone to choosing even numbers, numbers divided by 5? Is there meaningful difference between rating something as 95 vs 96? Furthermore I agree with the authors' notes that Google N-gram data is not culturally representative of the US population: only the literary elite. How does Google NGrams decide what publications are added to database? Furthermore are there any ways to study the underrepresented populations using a similar approach as the one presented in this paper?

Brian-W00 commented 7 months ago

How do word embedding models demonstrate the stability of cultural dimensions over the 20th century, despite the continuous shift in class markets due to economic transformations?

anzhichen1999 commented 7 months ago

How can we apply the orthogonal projection of word vectors onto predefined cultural dimensions, such as gender or class, within word embedding spaces to quantitatively analyze and trace the changes in cultural biases and stereotypes in English language usage, especially when dealing with COHA corpus?

erikaz1 commented 7 months ago

Kozlowski et al. assert that the dimensions of embedding models encode meaningful interplay between and within cultural categories, "rather than simply biases, distortions, or deficits in the semantic system". How do we fundamentally decide these dimensions of embedding models? Do we reproduce cultural categories by making particular choices? Are these dimensions true beyond or more salient within academia vs. in other contexts?

HamsterradYC commented 7 months ago

While word embeddings can quantitatively map relationships between words, the interpretation of what these relationships signify culturally is subjective, then how do understand the risk of bias in how these dimensions are understood or labeled and how to avoid it?

beilrz commented 7 months ago

I think this is a very interesting study. However, in my understanding, word embedding merely express the meaning of a word in relationship to the meaning of other words in the same period, but the meaning of these referenced words may not be stable. For example, if word is related to "gay", it may mean happy in the past, but homosexual in the present, although I believe this issue is less a concern if we use multiple referenced word.

chenyt16 commented 7 months ago

In Figure 9., we can see that the correlation of common words' projection for the seven cultural dimensions of class decreases in the 20th century. I wonder if there are any potential explanations for this since employment, education, status and other dimensions are still critical to social class nowadays. Can word embedding models help us to explore the underlying mechanism? In other words, is the word embedding model more helpful in testing predictions or hypotheses rather than discovering "surprising facts"?

YucanLei commented 7 months ago

The paper had done an amazing job at analyzing the complexities of the social categories, cultural dimensions, social interactions and instituional practices. However, we know these factors are not only intertwined, but also they influence and shape each other. This seems like a fact that was not extensively addressed in the paper. Also, how should some of the findings in the paper be applied in the reality and social work?

Dededon commented 7 months ago

That's a very interesting research to combine the sociology inquiry question design, and the word embedding methods that are well-established in the NLP papers. The researchers use a well-defined pairs of dictionaries to find out the dimension of class differences. I'm only little skeptical about the race survey dimension of white and black. Adding pairs of antonyms together could create a really weird dimension, but at least the results seem interpretable in this case.

Caojie2001 commented 7 months ago

It's an interesting article that apply word embedding methods to quantitatively analyze the social cultural structure. Considering my personal experience making use of word embedding methods, how should we treat those patterns discovered without an academic context?

michplunkett commented 7 months ago

Added algorithmic complexity can produce more sensitive and informative models, but it may also diminish the researcher's understanding of how the model is generated and what distortions it is likely to produce.

I appreciated the content of the paper and realize that it's created explicitly for the eyes of academics in its current form, but am left wondering how can you convey the content of studies like this without requiring the reader to have a thorough understanding of ML's technical jargon. Even after several quarters of navigating papers using linear algebraic and data science vocabulary, I am left a little more overwhelmed by the phrasing of the paper than I'd care to admit. How can researchers effectively ride the line of truthfully conveying their research findings while at the same time making their papers/work accessible to greater than 2-3% of the general population?

Carolineyx commented 7 months ago

This provides a fresh perspective on the analysis of culture. I'm curious about how this method can capture the evolving meanings of the same words or the emergence of new phrases and word combinations that may deviate significantly from their original meanings but become highly popular in certain years. Could this introduce noise into the analysis, or would it be manageable by incorporating time as an additional variable?

JessicaCaishanghai commented 7 months ago

The paragraph discusses the utilization of word embedding models as a valuable tool for studying culture, using a historical analysis of shared understandings of social class as an empirical case. How do word embedding models effectively capture and represent the dynamic shifts in markers of social class over a century, as discussed in the study?

floriatea commented 6 months ago

Really helpful and comprehensive analysis of cultural insights on word embeddings! Despite significant economic transformations over the twentieth century, the study finds that the basic cultural dimensions of class remained remarkably stable. What does this reveal about the relationship between cultural perceptions of class and actual economic conditions, and how might this influence our understanding of social mobility and class dynamics? Also, the paper notes a notable exception in the association of education with affluence, becoming more tightly linked over time. How does this shift affect the societal value placed on education, and what implications does it have for understanding the evolving landscape of social stratification? Does that suggest we should lower the cost of education and make it more affordable fore everyone?

Vindmn1234 commented 6 months ago

How do the findings from your analysis of class through word embeddings contribute to our broader understanding of how economic and social transformations influence language and cultural perceptions over time?

UChicago-Computational-Content-Analysis / Readings-Responses-2024-Winter

4. Word Embeddings to Explore Meaning Spaces- orienting #41