4. Word Embeddings to Explore Meaning Spaces-Challenge

lkcao commented 6 months ago

Post your response to our challenge questions.

First, write down two intuitions you have about broad content patterns you will discover in your data. These can be the same as those from last week...or they can evolve based on last week's explorations and the novel possibilities that emerge from continuous, high-dimensional embeddings. As before, place an asterisk next to the one you expect most firmly, and a plus next to the one that, if true, would be the biggest or most important surprise to others (especially the research community to whom you might communicate it, if robustly supported). Note that these expectations become the basis of abduction--to condition your surprise. Second, describe the dataset(s) on which you will build an embedding model to explore these intuitions. Then place (a) a link to the data, (b) a script to download and clean it, (c) a reference to a class dataset, (d) an invitation for a TA to contact you about it, OR (e) a brief explanation why the data cannot be made available. Please do NOT spend time/space explaining the precise embedding or analysis strategy you will use to explore your intuitions. (Then upvote the 5 most interesting, relevant and challenging challenge responses from others).

sborislo commented 5 months ago

Two Intuitions: (1*) Games developed by larger companies will have reviews that cite those companies more frequently, and discuss them in a more negative light. (2+) Different genres of games will have reviews that differ in their focus on technical vs. content-based aspects of the games. Technical details will more often be negative while content details will more often be positive, meaning content-focused games (i.e., those focused on broad and original content as opposed to better performance) will be favored in the Steam market (as measured by reviewed quality).

Dataset: For all intuitions, this dataset scraped from Steam. Also includes the code for getting player counts. More games can be scraped in this manner, and the amount of reviews scraped should probably be reduced substantially when doing so. This dataset contains all reviews for any game on Steam that is specified (or spidered, though I haven't done that yet). And, as stated, the other code is for getting player counts for any specified games.

Twilight233333 commented 5 months ago

Intuitions:

Countries with similar positions in certain areas (or on all issues) can be identified by the transcript of the statements made by the representatives of the United Nations, the content of the draft, and the voting results.*
Countries that have similar ideas on an issue are more likely to have similar cultural backgrounds, national strengths, etc. Small island states, for example, are concerned about climate change.+

Dataset:

UN digital libaray could be found there

QIXIN-ACT commented 5 months ago

Intuition 1*: Fanfiction stories with negative themes or conflicts garner more attention and engagement compared to those with more traditional, fairy tale-like plots. The preference becomes more obvious nowadays than before.

Intuition 2+: Fan-fiction narratives are increasingly adopting similar plots and themes - there is a noticeable trend of decreasing innovation and novelty in fan-fiction. This could be attributed to writers being influenced by other fan-fiction works, leading them to replicate popular themes and plots. Alternatively, it may reflect a growing preference within the community for familiar themes, possibly due to a reluctance to experiment with innovative ideas that might not be as well-received.

This trend, if confirmed, would not only be surprising but also significant, as it challenges the commonly held perception of fan-fiction (participatory culture) as a highly creative and diverse domain.

AO3 Dataset

h-karyn commented 5 months ago

Intuitions:

The textual content on dating app profiles (e.g., self-intro, hobby, future goals) are associated with the demographic information (e.g., age, gender religion). + The topics of those textual content is predictive of users demographic information. * If we treat demographic information as the labels, we should fine-tune a pre-trained LLM on the dating profile generation task. Although the evaluation of such a fine-tuned model can be tricky.

Dataset: The dataset: https://www.kaggle.com/datasets/andrewmvd/okcupid-profiles/data.

chanteriam commented 5 months ago

Possible intuitions:

*Over time, rights and considerations for a person seeking abortion or reproductive health care will be supplanted for the rights and considerations of their fetus.
Exceptions made for abortion will overtime reveal the biological attributes the law values/lawmakers value (and don't value) in human beings, such as allowing abortions for detected developmental disabilities (eugenics-esque).+
*Arguments for (or against) abortion access will include unique pushback from the other side (i.e., every argument has a unique counter argument).

Congressional legislation regarding abortion link

chenyt16 commented 5 months ago

Intuitions: (1) * Media outlets regarding abortion-related news are often framed according to their political or ideological perspectives. Conservative outlets might emphasize aspects like the rights of the unborn or religious perspectives, while liberal outlets may focus more on women's rights and personal autonomy. (2) The way abortion is covered can vary based on the geographical location and cultural context of the media outlet (e.g., religion, political preference). (3) Coverage might vary based on the political climate, with heightened attention during key legislative debates or elections.

The dataset and script I used to acquire them can both be found here.

The database is small right now because I faced some challenges in scraping text from dynamic web pages. For example, if I want to scrap more than 10 news pieces, I need to click the "load more" button on the search results page. But I don't know how to do that...

yuzhouw313 commented 5 months ago

Intuitions:

The comments are likely to evolve from early curiosity about the virus's origins and travel bans with the key term "China," to later discussions on protective measures/policies and rising infection count with the key term "U.S." *
Comments may initially include topics seeking COVID-19 facts, later shifting towards misinformation and conspiracy theories regarding masks, vaccines, and official data. +

The comments dataset can be found here

bucketteOfIvy commented 5 months ago

Intuitions: 1.+ Trans users of 4chan will have a large emphasis on mental aspects of womanhood (i.e. "thinking" like a woman), with many of the described aspects being describable in terms of habitus. 2.* For a forum nominally for lgbt people in general, /lgbt/ will have a disproportionate amount of discussion about and from trans people, particularly trans women.

Dataset: 4chan posts data scraped periodically from board.4chan.org/lgbt/. Script here. I really want to move this to Midway so I can automate collection (instead of just collecting posts whenever I'm at my computer and think about it).

anzhichen1999 commented 5 months ago

Greater Negative Sentiment in Chinese Version: There might be a more pronounced negative sentiment towards the US government in the Chinese version compared to the foreign version. This could be due to differing editorial policies and audience targeting. *

Neutral or Positive Sentiment in Foreign Version : The foreign version of People's Daily might exhibit a more neutral or even positive sentiment towards the US government, potentially as a strategy to present a more balanced view to an international audience.+

Dataset: Peoples' Daily Chinese version and Foreign Version: Chinese https://github.com/prnake/CialloCorpus Foreign: https://github.com/702036240/Spider-People-s-daily

volt-1 commented 5 months ago

Users with diverse cultural backgrounds will display distinct language patterns and word choices in their profiles. This could be reflected in the use of specific phrases, idioms, or cultural references. (*)

The age of users might significantly influence the type of slang and language style used in their profiles. younger users might employ more contemporary slang and internet jargon.

There will be notable differences in the expression of hobbies and interests based on geographic location. For example, users from coastal areas might mention more water-related activities. (+)

Dataset Description:

The dataset is from dating-app user profiles, specifically their self-introductions or bios. These texts are rich sources of personal information, including religious beliefs, lifestyle, interests, hobbies, and more.

ana-yurt commented 5 months ago

*1. In terms of perceived cultural dimensions like violence, morality, and trustworthiness, Hui ethnic groups sit in between Han and Uyghur in the Chinese-language corpora. +2. Certain areas of representation may deviate from conventional cultural association due to the persistence of political language

Data: https://drive.google.com/file/d/1abO2GPDHMmXw6tSrx3f5eXg6lZLtK0Yp/view?usp=sharing Tencent Baseline: https://ai.tencent.com/ailab/nlp/en/download.html

runlinw0525 commented 5 months ago

Intuitions:

*The overall attitude towards AI, as derived from a collection of course syllabi from the selected U.S. public university, is expected to be somewhat supportive. This assumption is based on the university's adoption of the GPT model and its development of a customized GPT.
The attitudes towards AI vary among departments. For example, the English Literature department may ban the usage of AI tools to help students avoid plagiarism while disciplines such as economics, which are more abstract, may be more generous with AI tools, treating them as powerful interpreters for various technical terms.
+Some syllabi may lack an emphasis on AI usage; therefore, there may be an urgent need to address AI usage in course syllabi across all universities. Dataset: the course syllabi archive from the University of Michigan, and the website itself is interactive and requires log-in (https://webapps.lsa.umich.edu/syllabi/Default.aspx). I scraped it using RSelenium, Chrome driver, and the Mouse Simulation package in R. My next step is to label each course syllabus as AI-supportive, not supportive, neutral, or unknown through manual inspection and other chatbots like ChatGPT for comparison.

ethanjkoz commented 5 months ago

*+Firstly, I believe that the sentiment of adoptees; posts in Adoptee oriented subreddits will be more negative than that of adoptees in more general adoption subreddits. I also theorize that posts expressing negative sentiment towards adoption will garner a higher voting score in adoptee oriented subreddits than those in a more general adoption subreddit. One portion of the data is available online as a json file (https://the-eye.eu/redarcs/ the search for r/adoption in the search bar). The other portion I can send via dropbox or some other file sharing service if necessary.

cty20010831 commented 5 months ago

Patterns to observe

*Overall, there are increasing quantitative methods (e.g., machine learning and deep learning) mentioned and used in psychology papers over the past 10 years.
(New from this week): Receiving funding/grants may make psychologists become more stuck in their niche of study (in terms of the diversity of topic of research.

Dataset

I intend to use Semantic Scholar API to scrape psychology papers over the past 10 years. Here is the sample code to extract and clean data. This is the link for National Science Foundation to extract funding information.

yunfeiavawang commented 5 months ago

Intuitions: 1*. Since my corpus was retrieved from feminist professional social media communities, I think the attitude towards male community members over time could evolve in a specific pattern. For example, when the group "Women in Social Science" began to exclude all the male members, the attitude toward males could be aggressive, and thus influence other communities in the same niche to reflect on gender issues in group management. Co-evolution or ignition of aggressive affect could happen.

Specific rules, active user overlap, and membership overlap among the communities could influence the process of co-evolution.

Dataset: https://drive.google.com/drive/folders/1uy3XpwctnhN_IwSbyDNkT9mv7ohtq0sK

Audacity88 commented 5 months ago

I've been surprised by the results of topic modeling the data set of Reddit posts I am using. Since the data et consists of two categories of posts, depressed and non-depressed, I thought 2 clusters (left) would definitely be optimal. However, 3 (right) is actually much better according to the metrics:

For 2 clusters: Homogeneity: 0.118 Completeness: 0.137 V-measure: 0.127 Adjusted Rand Score: 0.123 For 3 clusters: Homogeneity: 0.381 Completeness: 0.256 V-measure: 0.306 Adjusted Rand Score: 0.358

It seems like the really depressed people are the purple ones in both charts. 3 clusters being better seems to imply that the difference between the red and green groups is significant -- maybe as significant as that between either "normal" group and the depressed people. But who are the red and green groups? The green group cares (or talks) about friends a lot. A couple of their top words are "boyfriend" and "hang". Sounds like young people! Meanwhile, the reds talk about family, home, and work, and among their top words are "husband" and "old", so they are definitely older.

My question is, what will happen when you apply the same clusters to a (relatively) normalized data set like COCA? My hypotheses: *The purple group will represent a small fraction of the overall population, but one that has grown over time. If you go back in time, the purple group will fade away, indicating that widespread depression is a modern phenomenon. Meanwhile, the red group will get larger, since older people represent more conservative attitudes. +However, there will be a constant emergence of a "new" red group as we go back in time. The older generation will become young compared to their forebears.

donatellafelice commented 5 months ago

candor corpus: conversational parties that talk about family or politics are more likely to have either more or less flowing conversations - i am thinking more animated, with less filler words and pauses+ pairs that talk about neutral topics have more stilted conversation *

booth studies: conversations that are marked as 'dialogue' will have more filler words* conversations marked as debate will have more individual topics and more personal topics +

candor corpus is available for download on request from here: https://www.science.org/doi/10.1126/sciadv.adf3197 still collecting the other data from the booth studies - i am not sure if the professors would give me permission to share, will check for next week!

erikaz1 commented 5 months ago

Here are some content patterns I am considering for my project on tracking perspectives and discourse on “CRT”.

I may be able to map a cultural shift in the fundamental use of the phrase CRT during the period of 2020-2022 over various types of media. To begin, I may associate the term with other adjectives over time. I highly expect to see that the phrase “CRT” holds a specific, unique connotation in the popular-news media over a particular time period (not surprising).

Patterns could be identified in how certain individuals, groups, and public intellectuals have a disproportionate impact on framing narratives and influencing public perception. It may be interesting to explore how long these entities hold influence over a portion of society and whether/when shifts in perceptions occur.

I am still using the S2ORC dataset (database of millions of journal articles across disciplines, but will be focusing on social science articles) (https://github.com/allenai/s2orc) and the NOW dataset (a continuously updating collection of news articles spanning many decades with billions of words). https://github.com/allenai/s2orc, https://www.english-corpora.org/now/.

YucanLei commented 5 months ago

Intuition: Games with that is stitched from different genres are more likely to receive reviews in different perspectives and perhpas more extreme.

Some games, depending on specific cultural norm or the special way of delievery of the game, may receive surprising reveiw keywords. (eg: Pokemon like games been received as rediscovering slavery)

Dataset: For all intuitions, this dataset scraped from Steam. Also includes the code for getting player counts. More games can be scraped in this manner, and the amount of reviews scraped should probably be reduced substantially when doing so. This dataset contains all reviews for any game on Steam that is specified (or spidered, though I haven't done that yet). And, as stated, the other code is for getting player counts for any specified games.

michplunkett commented 5 months ago

Possible intuitions:

The focus for bills opposing access to abortion will vary widely in topic with little consistency between years, as they legislators attempting to find gaps where possible. *
As temporal distance is created from the Roe ruling, bills will reference the fetus in more abstract ways (fulling feeling human, person with rights, etc.) in an attempt to craft narratives around the actual capabilities of a gestational fetus. +

Congressional legislation regarding abortion: Link

XiaotongCui commented 5 months ago

My biggest intuition is that my corpus is not suitable for clustering. So even though the algorithm can try its best to classify clusters but the scores are low and also no intuitive information.

Caojie2001 commented 5 months ago

In the corpus of Chinese newspaper articles, the context of the word 'America' may be different from other foreign countries.
Domestic articles in Chinese newspapers generally describe politicians more positively compared wth international news. Corpus: Chinese electric newspaper such as XinMin.

floriatea commented 4 months ago

Words that changed the most from the cleaned telehealth corpus from 2017 to 2023:

Words like 'pacifica', 'persistence', 'analytics', 'evangelist', 'telecardiology', 'mantra', 'enduser', 'wise', and 'devoted' are among those that have shown significant semantic shifts. These words likely represent emerging concepts, technologies, or trends that have gained prominence or evolved in meaning during the period studied. For instance, 'analytics' and 'telecardiology' might indicate a growing focus on data analysis and remote healthcare, respectively. 'Evangelist' in a modern context often refers to someone who promotes a particular technology or innovation, which could suggest an evolving role in the tech or business sectors. Data is from purchased NOW data from 2017-2023

joylin0209 commented 4 months ago

Intuitions:

Sentiment Analysis: I expect to observe varying sentiment patterns in the comments, reflecting readers' emotions and opinions towards the articles' topics and contents. +Topic Engagement Over Time: The most surprising finding would be a significant shift in engagement levels across different topics over time, indicating evolving reader interests and societal trends. Description of Dataset:

I will be exploring these intuitions using a dataset containing comments made on articles published by the New York Times. The dataset covers the periods of January to May 2017 and January to April 2018. It consists of two CSV files: one containing information about the articles and another containing information about the comments made on those articles.

(a) Link to the data: New York Times Articles and Comments Dataset

HamsterradYC commented 4 months ago

Intuitions: *The prevalence of burnout language will be higher in tweets from user accounts that frequently post about work and productivity.

' + Posts containing expressions of burnout will receive more engagement (likes, retweets) than typical tweets, indicating a communal resonance with the sentiment of burnout. This would be surprising as it suggests a collective acknowledgment and shared experience of burnout.

To export these intuitions I will download several subreddits and create a large corpus of posts filtered by hashtags or keywords associated with work, productivity, and burnout.

Brian-W00 commented 3 months ago

1 *. There was a decrease in users' sentiment after COVID-19 in different Reddit communities

The sentiment in supportive communities is slightly higher than other groups
The sentiment of comments has no obvious relation with the post's sentiments DataSource: Reddit

Carolineyx commented 3 months ago

Narrative Complexity and Psychological Richness: My primary intuition is that stories with higher levels of narrative complexity and psychological richness—characterized by unexpected encounters, emotional depth, and detailed descriptions—correlate with a greater sense of relationship fulfillment and resilience. This intuition is grounded in the notion that complex narratives may foster a stronger, more nuanced bond between partners, potentially influencing the longevity of their relationship.

Initial Meeting Context and Marital Stability:A secondary, yet intriguing, intuition is that the context of a couple's initial meeting (e.g., through unexpected circumstances vs. traditional settings) might impact marital stability.

To explore these intuitions, I will build an embedding model on the dataset comprising 206 'how they met' stories collected from The New York Times wedding announcements, published between 2006 and 2010. This dataset, enriched with human and machine ratings of narrative interestingness and psychological richness, offers a fertile ground for examining the nuances of romantic inception narratives.

Due to the sensitive nature of the personal stories and the proprietary constraints of the data collection method, the dataset cannot be made publicly available.

UChicago-Computational-Content-Analysis / Readings-Responses-2024-Winter

4. Word Embeddings to Explore Meaning Spaces-Challenge #35