3. Clustering & Topic Modeling to Discover Higher-Order Patterns of Meaning -Challenge

lkcao commented 10 months ago

Post your response to our challenge questions.

First, write down three intuitions you have about broad content patterns you will discover in your data. Plan an asterisk next to the one you expect most firmly, and a plus next to the one that, if true, would be the biggest or most important surprise to others (especially the research community to whom you might communicate it, if robustly supported). Second, describe the dataset(s) on which you will build an unsupervised model to explore these intuitions. Then place (a) a link to the data, (b) a script to download and clean it, (c) a reference to a class dataset, (d) an invitation for a TA to contact you about it, or (e) a brief explanation why the data cannot be made available. Please do NOT spend time/space explaining the precise unsupervised strategy you will use to explore your intuitions. (Then upvote the 5 most interesting, relevant and challenging challenge responses from others).

sborislo commented 10 months ago

Three Intuitions: (1) When parsing for three clusters of similar reviews, the three clusters will be identifiable as (i) technical/mechanical details, (ii) content/story/experiential analysis, and (iii) joke reviews of some sort. (2+) Even if (1) is unevidenced, the truly observed clusters will differ in their prominence across game categories and/or popularities (number of total players). (3*) Less serious/informative reviews will be made for cheaper games.

Dataset: For all intuitions, this dataset scraped from Steam. Also includes the code for getting player counts. More games can be scraped in this manner, and the amount of reviews scraped should probably be reduced substantially when doing so. This dataset contains all reviews for any game on Steam that is specified (or spidered, though I haven't done that yet). And, as stated, the other code is for getting player counts for any specified games.

cty20010831 commented 10 months ago

Patterns to observe:

Overall, there are increasing quantitative methods (e.g., machine learning and deep learning) mentioned and used in psychology papers over the past 10 years.
There are increasingly inter-disciplinary collaboration of psychologists with scholars from physical science fields (e.g., data science and computer science).
Some subfields within the domain of psychology (e.g., cognitive psychology) are more likely to observe the previous two trends compared to other subfields such as positive psychology and developmental psychology.

Dataset: I intend to use Semantic Scholar API to scrape psychology papers over the past 10 years. Here is the sample code to extract and clean data.

chanteriam commented 10 months ago

Possible intuitions:

Over time, rights and considerations for a person seeking abortion or reproductive health care will be supplanted for the rights and considerations of their fetus.
In SCOTUS decisions, there will be less desire to make a definitive decision regarding abortion, whereas more opinionated and morality-based abortion arguments will be made in local/state legislation.*
Exceptions made for abortion will overtime reveal the biological attributes the law values/lawmakers value (and don't value) in human beings, such as allowing abortions for detected developmental disabilities (eugenics-esque).+

Congressional legislation regarding abortion link

Twilight233333 commented 10 months ago

Intuitions:

Some software now uses algorithmic recommendation mechanisms, leading to an increasing focus on specific topics*

Algorithmic mechanisms filter and filter comments, causing people to become more and more convinced of their own opinions+

The algorithm mechanism will automatically generate reply suggestions, resulting in homogenization of people's comments and replies

Cluster analysis can be used to study the categories of social platform users' followers like Twitter

Caojie2001 commented 10 months ago

Intuitions:

The popularity of certain topics in newspapers may have a predictable temporal pattern that is embedded into existing orders of political agenda.
The similarity considering topics between newspapers published by local governments and central governments is influenced by certain political events rather than consistent.
For local newspapers, articles related to certain locations may have different sentiment patterns, considering the relationship between the location and the government that publishes the newspaper.

The dataset for analysis can be achieved from websites of newspapers such as Xin Min. Here is an example of data scraping.

yuzhouw313 commented 10 months ago

Intuitions:

1. The comments are likely to evolve from early curiosity about the virus's origins and travel bans with the key term "China," to later discussions on protective measures/policies and rising infection count with the key term "U.S."
1. Comments may initially include topics seeking COVID-19 facts, later shifting towards misinformation and conspiracy theories regarding masks, vaccines, and official data.
2. Topics regarding to social distancing might shift from criticism and anger toward quarantine to adapting and new working mode (working from home)

The comments dataset can be found here

Audacity88 commented 10 months ago

Expected patterns:

*Depressed people talk more about themselves, and less about their friends and social groups. A classifier will be able to identify depression or lack of purpose in Reddit posts based on this and/or more subtle linguistic clues, and classify posters as depressed or non-depressed.
If the classifier is able to provide a continuous value for "purposiveness", rather than a binary prediction of depressed/non-depressed, this will show a spectrum of purposiveness, with some people having it to extreme degrees (depression/absolute fulfillment?) but most falling in the middle.
+Applying this same classifier to a corpus of writing over a large span of time (COCA, Google Books) will show a decrease in average purposiveness over the past 1-2 centuries.

Data set: For the classifier, Reddit posts from depressed and non-depressed users are provided by a previous study. If more data is needed, can also use the Pushshift archives. For the historical comparison, Google Ngrams and the COCA.

QIXIN-LIN commented 10 months ago

Intuition 1*: In the realm of fanfiction, newer works tend to be less engaged with, possibly due to a general trend towards shorter attention spans. This could manifest in shorter fanfics receiving more kudos and hits in recent times compared to longer ones. This intuition is significant because it speaks to changing reader preferences and behaviors in online literature communities. If true, it could indicate a broader shift in content consumption patterns.

Intuition 2: Fanfiction stories with negative themes or conflicts garner more attention and engagement compared to those with more traditional, fairy tale-like plots. This could be a surprising insight to the research community. It challenges the common perception that audiences prefer more positive or escapist narratives, especially in fan-created works.

Intuition 3: The amount and nature of comments received by fanfiction authors can positively influence their writing frequency and ability to complete works, even in a non-profit setting. This intuition focuses on the social aspect of fanfiction communities. It suggests that community feedback is a significant motivator and support mechanism for content creators.

AO3 Dataset

donatellafelice commented 10 months ago

Three Intuitions: (1) Transcripts of the experiments show there are linguistic patterns that are more effective in rebuilding trust (experiments already run by Booth) (2+) People presenting to a public forum also use specific language when they feel they are distrusted by their audience (3*) Historical data will show that public presentations in situations where there is distrust (speaker distrusted by audience) are influenced heavily by popular culture (closer to movie/TV dialogue than real speaking)

Dataset: I am currently waiting for data on these studies that have already been run to be shared. After I receive the data, I will put together a historical corpus to compare it to. I propose to use public presentation from the CDC during COVID and also publicly traded companies share holder meetings after major scandals etc and compare it to the candor corpus for real speaking (https://www.science.org/doi/10.1126/sciadv.adf3197) as well as the TV and Movie data bases we have.

anzhichen1999 commented 10 months ago

Greater Negative Sentiment in Chinese Version: There might be a more pronounced negative sentiment towards the US government in the Chinese version compared to the foreign version. This could be due to differing editorial policies and audience targeting. *

Variation in Sentiment Over Time: Significant fluctuations in sentiment towards the US government across different decades, potentially reflecting the changing political and economic relations between China and the United States.

Neutral or Positive Sentiment in Foreign Version : The foreign version of People's Daily might exhibit a more neutral or even positive sentiment towards the US government, potentially as a strategy to present a more balanced view to an international audience.+

Dataset: Peoples' Daily Chinese version and Foreign Version: Chinese https://github.com/prnake/CialloCorpus Foreign: https://github.com/702036240/Spider-People-s-daily

Vindmn1234 commented 10 months ago

Intuitions: Prevalence of Technology-Related Skills (*): a significant portion of job listings will emphasize technology-related skills, reflecting the growing demand for tech proficiency in various industries. Rise in Remote Work Opportunities: If true, the notable presence of remote work opportunities in job listings would be a major revelation, indicating a substantial shift in work culture, particularly important for the research community focusing on labor market trends. Diversity and Inclusion Initiatives: an increasing number of companies will mention their commitment to diversity and inclusion within their job listings, reflecting a broader societal shift towards these values.

Dataset: I got two separate dataset of linkedin job listings in U.S. One was obtained directly from Kaggle, it records 30000 job listings in the year of 2019; the other one was manually collected by web scraping and it records the job listings in the year of 2023.12/2024.01, this one is still ongoing, I only got 3000+ jobs information so far, the scraping process is quite time-consuming because I have to add random sleep time for each iteration when using selenium in order to bypass the censor mechanism. Check out the data available here

bucketteOfIvy commented 10 months ago

Intuitions: 1.+ Trans users of 4chan will have a large emphasis on mental aspects of womanhood (i.e. "thinking" like a woman), with many of the described aspects being describable in terms of habitus.

/lgbt/ will have an ongoing conflict between incel-esque views, transphobic views, and more "main line" views about gender from transgender people. 3.* For a forum nominally for lgbt people in general, /lgbt/ will have a disproportionate amount of discussion about and from trans people, particularly trans women.

Dataset: 4chan posts data scraped periodically from board.4chan.org/lgbt/. WIP script here; I plan to flesh this out so I can periodically call it from either one of the Midway clusters or from a Google Collab sheet, allowing for a better sample of data.

ethanjkoz commented 10 months ago

1.*+ The sentiment towards adoption in r/Adopted and r/Adoptees will be significantly different. r/Adopted will be more positive or neutral, while r/Adoptees will be more negative.

Users who express negative sentiment in r/Adopted will have more positive post scores (upvotes) than users who express the same in r/Adoption
Posts with extremely negative or positive sentiment will garner more comments and user interaction. Source: For now, I have been using an archived version of r/Adoption (from https://the-eye.eu/redarcs/) , and plan to scrape r/Adopted myself but

Marugannwg commented 10 months ago

My research questions are about the social phenomenons around the consumption of "Waifu games" -- those free-to-play, primarily mobile games that originated from Japanese anime/manga aesthetics; the key feature is that the entire development of game content revolves around selling the character portraited (often through gatcha pulls)

Intuitions:

*Despite the global audience of the game, it's likely there would be an over-representation of Asian/Japanese cultural elements, manifested in character setting, plot themes, moral values, and etc.
There might be reinforced (gender) stereotypes around characters to emphasize the conflicts/drama in a smooth, widely accepted narration.
+To target niche appeals, some popular/main character (for sell) may conceive traits unlike (or subvert) traditional hero archetypes observed in other types of games or media.

Sample data: Full game scripts from Arknights

ana-yurt commented 10 months ago

Expected patterns: *1. Zhihu posts discussing different ethnic groups will frame their discussion around distinct sets of topics

Over the past several decades, cultural representations of Sino Muslims (Hui) saw a shift from seeing them as belligerent to recasting them as well-integrated model minorities. +3. Zhihu users engage in a disproportionate amount of discussion on the gender dynamics between Han and non-Han groups. Source: Data is scraped based on topics. Due to the speed of collection (xinjiang data took 3 nights), I have yet to scrape the Sino Muslim topic. A link to a portion of the scraped Zhihu data (xinjiang) is here: https://github.com/ana-yurt/Content-Analysis-Textual-Data/blob/main/zhihu_answer_content_xinjiang_3.csv

ddlxdd commented 10 months ago

The most prevalent mood descriptors will be polar opposites, such as "high" and "low," reflecting the bipolar nature of the disorder.
The language used in the forum will have a significant amount of emotional and psychological terminology, indicating a community that is knowledgeable about their condition and the science behind it.
The data may reveal an underlying pattern of specific triggers or events that precede a change in mood states (from "high" to "low" or vice versa). If this pattern is strong, it could be of great importance to the research community.

Here is the link to the forum where I am planning to scrape the data: psych forum

chenyt16 commented 10 months ago

Three intuitions: (1) Media outlets regarding abortion-related news are often framed according to their political or ideological perspectives. Conservative outlets might emphasize aspects like the rights of the unborn or religious perspectives, while liberal outlets may focus more on women's rights and personal autonomy. (2) The way abortion is covered can vary based on the geographical location and cultural context of the media outlet (e.g., religion, political preference). (3) Coverage might vary based on the political climate, with heightened attention during key legislative debates or elections.

I tried to use Davies [News on the Web (NOW)][https://www.english-corpora.org/now/], but it didn't cite the source of each piece of news very well. So I will probably scrape the news by myself, and I need some more time to get it prepared.

volt-1 commented 10 months ago

Intuitions:

Users' religious beliefs might influence their language usage and content in their profiles. For example, religious individuals might use fewer swear words or slang. (*)
Users with strong religious beliefs might more frequently mention family, moral values, or spiritual beliefs in their profiles. (+)
Users' level of education and occupation might correlate with how they express their religious beliefs, such as highly educated users preferring specific vocabularies to implicitly express their faith.

Dataset Description: The dataset we plan to use consists of a collection of dating-app user profiles, specifically their self-introductions or bio (essays). These text data may contain clues about users' religious beliefs, lifestyle, interests, hobbies, and other personal information.

muhua-h commented 10 months ago

Intuitions:

The textual content on dating app profiles (e.g., self-intro, hobby, future goals) are associated with the demographic information (e.g., age, gender religion). +
The topics of those textual content is predictive of users demographic information. *
If we treat demographic information as the labels, we should fine-tune a pre-trained LLM on the dating profile generation task. Although the evaluation of such a fine-tuned model can be tricky.

Dataset (the same as @volt-1): The dataset: https://www.kaggle.com/datasets/andrewmvd/okcupid-profiles/data.

runlinw0525 commented 10 months ago

Intuitions:

1. The overall attitude towards AI, as derived from a collection of course syllabi from the selected U.S. public university, is expected to be somewhat supportive. This assumption is based on the university's adoption of the GPT model and its development of a customized GPT.
1. The attitudes towards AI vary among departments. For example, the English Literature department may ban the usage of AI tools to help students avoid plagiarism while disciplines such as economics, which are more abstract, may be more generous with AI tools, treating them as powerful interpreters for various technical terms.
2. Some syllabi may lack an emphasis on AI usage; therefore, there may be a need to address AI usage in course syllabi across all universities.

Dataset: the course syllabi archive from the University of Michigan, and the website itself is interactive and requires log-in (https://webapps.lsa.umich.edu/syllabi/Default.aspx). I scraped it using RSelenium, Chrome driver, and the Mouth Simulation package in R, and the data is now sitting in a CSV file, ready for basic cleaning and further text analysis.

beilrz commented 10 months ago

Dataset: news headline scaped from popular US news media. I am still in the process of cleaning the html files, and I expect the text data to be available sometime next week.

Three Intuitions: 1*. news media with different political leaning cover the same news topic at a given time. 2+. news media with different political leaning have emphasize on news topic.

the political leaning of news media is mainly manifest through praising one's own political stance.

erikaz1 commented 10 months ago

The papers in my CRT corpus encourage discussion and conversation (currently driven by a handful of papers involving education). The broad content patterns occurring within my data may involve different ways to express how to perceive, experience, and learn through a new lens.

*Online & Media Discourse Dynamics differ from the Academic: Patterns might emerge in the tone, sentiment, and framing of discussions about CRT on different digital platforms, reflecting shifts in public opinion and in response to a cultural shift in the fundamental use of the phrase CRT.
Impact of Influencers and Thought Leaders: Patterns could be identified in how certain individuals or groups have a disproportionate impact on framing narratives and influencing public perception.
Geographical Patterns in Online CRT Discourse: Examining geographical markers in online discourse may uncover variations in regional perspectives, policy implications, or socio-historical attitudes toward the theory.

I will be using the S2ORC dataset (database of millions of journal articles across disciplines) and the NOW dataset (continuously updating collection of news articles spanning many decades with billions of words). https://github.com/allenai/s2orc, https://www.english-corpora.org/now/.

HamsterradYC commented 10 months ago

Content Patterns in Social Media Posts on Self-Discipline

1.Increased Personal Reflection: I expect to see a notable increase in personal reflection and introspection following posts about high self-discipline. This could manifest as more posts discussing personal goals, challenges, and achievements.

2+.Variation in Engagement Levels: Posts related to high self-discipline might receive varying levels of engagement (likes, comments, shares) depending on the tone and content. Positive and motivational posts may receive higher engagement than those perceived as overly strict or harsh.

3*+.Shift in Topics Post Self-Discipline: There could be a shift in the topics discussed post self-discipline expressions. Users might start discussing related themes like productivity, mental health, or physical fitness. Whether there is an outburst of negative emotions such as burnout after expressing relevant topics for a long time.

Dataset: Captured posts related to self-discipline through weibo and analyzed user attributes and social network structure. The data fetch code is still being written. Weibo api https://open.weibo.com/wiki/%E5%BE%AE%E5%8D%9AAPI

naivetoad commented 10 months ago

There might be significant differences in the average award amounts granted to different divisions. This could reflect the varying funding needs or priorities across different scientific or academic fields.*
Another possible pattern could be a correlation between the number of citations and the award amounts. This would imply that researchers with higher citation counts tend to receive larger grants.
There might be interesting temporal trends in research funding, such as increases or decreases in funding over time for certain divisions or overall. This could be a significant surprise and provide insights into the shifting priorities in research funding.+

On the U.S. National Science Foundation website, files of historical awards by year can be downloaded. I downloaded all files from 2018, collected names, emails and academic divisions of award winners, effective and expiration dates, award amount, and abstracts, and saved them in a csv file. Based on the names and emails, I found Google Scholar urls of these award winners, collected their publication titles, research interests, h-index, total citations, and citations by year, and saved them in another csv file. However, due to limited time, I only processed 100 award winners and gathered 42 complete cases after dropping those with missing emails or absent from Google Scholar. My teammates and I will collect and process more data from different years in the following weeks.

raw data from NSF files updated data with Google Scholar info scripts collecting and cleaning the data

XiaotongCui commented 10 months ago

First intuition: Men's ideal types have more demands for characteristics like being family-oriented, but women do not (*).

Second intuition: Most people prefer to describe an external scenario or express the desire to do something together in ideal type descriptions.

Third intuition: The differences in ideal types between men and women should be greater than within the group (+).

Data: Sourced from the Kaggle platform.

michplunkett commented 10 months ago

Three Intuitions: (1*) I expect to see similar verbiage between the literary texts and legislation. (2+) I expect to see excerpts from the literary texts either indirectly referenced or quoted through the use of synonymous text. (3) I expect to see tonal similarities between the literary texts and the enacted legislation.

Dataset: Elders of the Protocol of Zion, The Turner Diaries, and The Bell Curve and all approved/enacted congressional legislation relating to law enforcement officers. The texts were acquired through various open source text repositories and the legislation text was obtained through congress's search page and API.

yunfeiavawang commented 10 months ago

Intuitions: 1*. Active user overlap positively predicts the similarity tendency of a dyad of communities.

Community rule isomorphism positively predicts the similarity tendency of a dyad of communities. 3+. The similarity tendency mostly happens in the dimension of sentiment rather than topic distribution.

Dataset: Top 10 feminist professional group on Chinese social media Douban. One example is the group "Women in Academia" (https://www.douban.com/group/705363/discussion?start=0&type=new).

floriatea commented 9 months ago

Some countries like Kenya, Malaysia, and Zambia have more documents related to Technology, while Tanzania and India have more on Abortion, State Law, and Women's Health. This could reflect different healthcare priorities or regulatory environments. *
US and Canada have topics with emphasis on "dollar," "company," "stock," "quarter," "revenue" implies a focus on the financial performance and investment aspects of companies involved in telehealth.
Some topics appear consistently across all countries, such as generic topics with terms like "health," "people," "say", suggests that certain aspects of telehealth are universally discussed or have global relevance.

Data is from NOW corpus https://www.english-corpora.org/now/ from 2017 to 2023.

joylin0209 commented 9 months ago

Social Network Dynamics: I anticipate observing clusters or communities within the network, indicating groups of individuals with similar interests or connections. Geographic Influence: There might be a correlation between geographic proximity and connections within the network, reflecting real-world social interactions. +Unexpected Behavioral Shifts: The most surprising finding would be identifying sudden shifts in network behavior or connectivity, suggesting external events or interventions impacting the network dynamics. Description of Dataset:

I will be exploring these intuitions using a dataset containing social network data from reddit, a popular online platform. This dataset includes information such as user profiles, connections between users, posts, comments, and likes. The data spans several years and covers users from diverse geographic locations and demographics.

(a) Link to the data:reddit

Brian-W00 commented 8 months ago

1 *. There was a decrease in users' sentiment after COVID-19 in different Reddit communities

The sentiment in supportive communities is slightly higher than other groups
The sentiment of comments has no obvious relation with the post's sentiments DataSource: Reddit

UChicago-Computational-Content-Analysis / Readings-Responses-2024-Winter

3. Clustering & Topic Modeling to Discover Higher-Order Patterns of Meaning -Challenge #42