Open lkcao opened 10 months ago
Three Intuitions: (1) When parsing for three clusters of similar reviews, the three clusters will be identifiable as (i) technical/mechanical details, (ii) content/story/experiential analysis, and (iii) joke reviews of some sort. (2+) Even if (1) is unevidenced, the truly observed clusters will differ in their prominence across game categories and/or popularities (number of total players). (3*) Less serious/informative reviews will be made for cheaper games.
Dataset: For all intuitions, this dataset scraped from Steam. Also includes the code for getting player counts. More games can be scraped in this manner, and the amount of reviews scraped should probably be reduced substantially when doing so. This dataset contains all reviews for any game on Steam that is specified (or spidered, though I haven't done that yet). And, as stated, the other code is for getting player counts for any specified games.
Patterns to observe:
Dataset: I intend to use Semantic Scholar API to scrape psychology papers over the past 10 years. Here is the sample code to extract and clean data.
Possible intuitions:
Congressional legislation regarding abortion link
Intuitions:
Some software now uses algorithmic recommendation mechanisms, leading to an increasing focus on specific topics*
Algorithmic mechanisms filter and filter comments, causing people to become more and more convinced of their own opinions+
The algorithm mechanism will automatically generate reply suggestions, resulting in homogenization of people's comments and replies
Cluster analysis can be used to study the categories of social platform users' followers like Twitter
Intuitions:
The popularity of certain topics in newspapers may have a predictable temporal pattern that is embedded into existing orders of political agenda.
The similarity considering topics between newspapers published by local governments and central governments is influenced by certain political events rather than consistent.
For local newspapers, articles related to certain locations may have different sentiment patterns, considering the relationship between the location and the government that publishes the newspaper.
The dataset for analysis can be achieved from websites of newspapers such as Xin Min. Here is an example of data scraping.
Intuitions:
The comments dataset can be found here
Expected patterns:
Data set: For the classifier, Reddit posts from depressed and non-depressed users are provided by a previous study. If more data is needed, can also use the Pushshift archives. For the historical comparison, Google Ngrams and the COCA.
Intuition 1*: In the realm of fanfiction, newer works tend to be less engaged with, possibly due to a general trend towards shorter attention spans. This could manifest in shorter fanfics receiving more kudos and hits in recent times compared to longer ones. This intuition is significant because it speaks to changing reader preferences and behaviors in online literature communities. If true, it could indicate a broader shift in content consumption patterns.
Intuition 2: Fanfiction stories with negative themes or conflicts garner more attention and engagement compared to those with more traditional, fairy tale-like plots. This could be a surprising insight to the research community. It challenges the common perception that audiences prefer more positive or escapist narratives, especially in fan-created works.
Intuition 3: The amount and nature of comments received by fanfiction authors can positively influence their writing frequency and ability to complete works, even in a non-profit setting. This intuition focuses on the social aspect of fanfiction communities. It suggests that community feedback is a significant motivator and support mechanism for content creators.
Three Intuitions: (1) Transcripts of the experiments show there are linguistic patterns that are more effective in rebuilding trust (experiments already run by Booth) (2+) People presenting to a public forum also use specific language when they feel they are distrusted by their audience (3*) Historical data will show that public presentations in situations where there is distrust (speaker distrusted by audience) are influenced heavily by popular culture (closer to movie/TV dialogue than real speaking)
Dataset: I am currently waiting for data on these studies that have already been run to be shared. After I receive the data, I will put together a historical corpus to compare it to. I propose to use public presentation from the CDC during COVID and also publicly traded companies share holder meetings after major scandals etc and compare it to the candor corpus for real speaking (https://www.science.org/doi/10.1126/sciadv.adf3197) as well as the TV and Movie data bases we have.
Greater Negative Sentiment in Chinese Version: There might be a more pronounced negative sentiment towards the US government in the Chinese version compared to the foreign version. This could be due to differing editorial policies and audience targeting. *
Variation in Sentiment Over Time: Significant fluctuations in sentiment towards the US government across different decades, potentially reflecting the changing political and economic relations between China and the United States.
Neutral or Positive Sentiment in Foreign Version : The foreign version of People's Daily might exhibit a more neutral or even positive sentiment towards the US government, potentially as a strategy to present a more balanced view to an international audience.+
Dataset: Peoples' Daily Chinese version and Foreign Version: Chinese https://github.com/prnake/CialloCorpus Foreign: https://github.com/702036240/Spider-People-s-daily
Intuitions: Prevalence of Technology-Related Skills (*): a significant portion of job listings will emphasize technology-related skills, reflecting the growing demand for tech proficiency in various industries. Rise in Remote Work Opportunities: If true, the notable presence of remote work opportunities in job listings would be a major revelation, indicating a substantial shift in work culture, particularly important for the research community focusing on labor market trends. Diversity and Inclusion Initiatives: an increasing number of companies will mention their commitment to diversity and inclusion within their job listings, reflecting a broader societal shift towards these values.
Dataset: I got two separate dataset of linkedin job listings in U.S. One was obtained directly from Kaggle, it records 30000 job listings in the year of 2019; the other one was manually collected by web scraping and it records the job listings in the year of 2023.12/2024.01, this one is still ongoing, I only got 3000+ jobs information so far, the scraping process is quite time-consuming because I have to add random sleep time for each iteration when using selenium in order to bypass the censor mechanism. Check out the data available here
Intuitions: 1.+ Trans users of 4chan will have a large emphasis on mental aspects of womanhood (i.e. "thinking" like a woman), with many of the described aspects being describable in terms of habitus.
Dataset: 4chan posts data scraped periodically from board.4chan.org/lgbt/. WIP script here; I plan to flesh this out so I can periodically call it from either one of the Midway clusters or from a Google Collab sheet, allowing for a better sample of data.
1.*+ The sentiment towards adoption in r/Adopted and r/Adoptees will be significantly different. r/Adopted will be more positive or neutral, while r/Adoptees will be more negative.
My research questions are about the social phenomenons around the consumption of "Waifu games" -- those free-to-play, primarily mobile games that originated from Japanese anime/manga aesthetics; the key feature is that the entire development of game content revolves around selling the character portraited (often through gatcha pulls)
Intuitions:
Expected patterns: *1. Zhihu posts discussing different ethnic groups will frame their discussion around distinct sets of topics
Here is the link to the forum where I am planning to scrape the data: psych forum
Three intuitions: (1) Media outlets regarding abortion-related news are often framed according to their political or ideological perspectives. Conservative outlets might emphasize aspects like the rights of the unborn or religious perspectives, while liberal outlets may focus more on women's rights and personal autonomy. (2) The way abortion is covered can vary based on the geographical location and cultural context of the media outlet (e.g., religion, political preference). (3) Coverage might vary based on the political climate, with heightened attention during key legislative debates or elections.
I tried to use Davies [News on the Web (NOW)][https://www.english-corpora.org/now/], but it didn't cite the source of each piece of news very well. So I will probably scrape the news by myself, and I need some more time to get it prepared.
Intuitions:
Dataset Description: The dataset we plan to use consists of a collection of dating-app user profiles, specifically their self-introductions or bio (essays). These text data may contain clues about users' religious beliefs, lifestyle, interests, hobbies, and other personal information.
Intuitions:
Dataset (the same as @volt-1): The dataset: https://www.kaggle.com/datasets/andrewmvd/okcupid-profiles/data.
Intuitions:
Dataset: the course syllabi archive from the University of Michigan, and the website itself is interactive and requires log-in (https://webapps.lsa.umich.edu/syllabi/Default.aspx). I scraped it using RSelenium, Chrome driver, and the Mouth Simulation package in R, and the data is now sitting in a CSV file, ready for basic cleaning and further text analysis.
Dataset: news headline scaped from popular US news media. I am still in the process of cleaning the html files, and I expect the text data to be available sometime next week.
Three Intuitions: 1*. news media with different political leaning cover the same news topic at a given time. 2+. news media with different political leaning have emphasize on news topic.
The papers in my CRT corpus encourage discussion and conversation (currently driven by a handful of papers involving education). The broad content patterns occurring within my data may involve different ways to express how to perceive, experience, and learn through a new lens.
I will be using the S2ORC dataset (database of millions of journal articles across disciplines) and the NOW dataset (continuously updating collection of news articles spanning many decades with billions of words). https://github.com/allenai/s2orc, https://www.english-corpora.org/now/.
Content Patterns in Social Media Posts on Self-Discipline
1.Increased Personal Reflection: I expect to see a notable increase in personal reflection and introspection following posts about high self-discipline. This could manifest as more posts discussing personal goals, challenges, and achievements.
2+.Variation in Engagement Levels: Posts related to high self-discipline might receive varying levels of engagement (likes, comments, shares) depending on the tone and content. Positive and motivational posts may receive higher engagement than those perceived as overly strict or harsh.
3*+.Shift in Topics Post Self-Discipline: There could be a shift in the topics discussed post self-discipline expressions. Users might start discussing related themes like productivity, mental health, or physical fitness. Whether there is an outburst of negative emotions such as burnout after expressing relevant topics for a long time.
Dataset: Captured posts related to self-discipline through weibo and analyzed user attributes and social network structure. The data fetch code is still being written. Weibo api https://open.weibo.com/wiki/%E5%BE%AE%E5%8D%9AAPI
On the U.S. National Science Foundation website, files of historical awards by year can be downloaded. I downloaded all files from 2018, collected names, emails and academic divisions of award winners, effective and expiration dates, award amount, and abstracts, and saved them in a csv file. Based on the names and emails, I found Google Scholar urls of these award winners, collected their publication titles, research interests, h-index, total citations, and citations by year, and saved them in another csv file. However, due to limited time, I only processed 100 award winners and gathered 42 complete cases after dropping those with missing emails or absent from Google Scholar. My teammates and I will collect and process more data from different years in the following weeks.
raw data from NSF files updated data with Google Scholar info scripts collecting and cleaning the data
First intuition: Men's ideal types have more demands for characteristics like being family-oriented, but women do not (*).
Second intuition: Most people prefer to describe an external scenario or express the desire to do something together in ideal type descriptions.
Third intuition: The differences in ideal types between men and women should be greater than within the group (+).
Data: Sourced from the Kaggle platform.
Three Intuitions: (1*) I expect to see similar verbiage between the literary texts and legislation. (2+) I expect to see excerpts from the literary texts either indirectly referenced or quoted through the use of synonymous text. (3) I expect to see tonal similarities between the literary texts and the enacted legislation.
Dataset: Elders of the Protocol of Zion, The Turner Diaries, and The Bell Curve and all approved/enacted congressional legislation relating to law enforcement officers. The texts were acquired through various open source text repositories and the legislation text was obtained through congress's search page and API.
Intuitions: 1*. Active user overlap positively predicts the similarity tendency of a dyad of communities.
Dataset: Top 10 feminist professional group on Chinese social media Douban. One example is the group "Women in Academia" (https://www.douban.com/group/705363/discussion?start=0&type=new).
Data is from NOW corpus https://www.english-corpora.org/now/ from 2017 to 2023.
Social Network Dynamics: I anticipate observing clusters or communities within the network, indicating groups of individuals with similar interests or connections. Geographic Influence: There might be a correlation between geographic proximity and connections within the network, reflecting real-world social interactions. +Unexpected Behavioral Shifts: The most surprising finding would be identifying sudden shifts in network behavior or connectivity, suggesting external events or interventions impacting the network dynamics. Description of Dataset:
I will be exploring these intuitions using a dataset containing social network data from reddit, a popular online platform. This dataset includes information such as user profiles, connections between users, posts, comments, and likes. The data spans several years and covers users from diverse geographic locations and demographics.
(a) Link to the data:reddit
1 *. There was a decrease in users' sentiment after COVID-19 in different Reddit communities
Post your response to our challenge questions.
First, write down three intuitions you have about broad content patterns you will discover in your data. Plan an asterisk next to the one you expect most firmly, and a plus next to the one that, if true, would be the biggest or most important surprise to others (especially the research community to whom you might communicate it, if robustly supported). Second, describe the dataset(s) on which you will build an unsupervised model to explore these intuitions. Then place (a) a link to the data, (b) a script to download and clean it, (c) a reference to a class dataset, (d) an invitation for a TA to contact you about it, or (e) a brief explanation why the data cannot be made available. Please do NOT spend time/space explaining the precise unsupervised strategy you will use to explore your intuitions. (Then upvote the 5 most interesting, relevant and challenging challenge responses from others).