Open lkcao opened 6 months ago
Two Intuitions: (1*) Games developed by larger companies will have reviews that cite those companies more frequently, and discuss them in a more negative light. (2+) Different genres of games will have reviews that differ in their focus on technical vs. content-based aspects of the games. Technical details will more often be negative while content details will more often be positive, meaning content-focused games (i.e., those focused on broad and original content as opposed to better performance) will be favored in the Steam market (as measured by reviewed quality).
Dataset: For all intuitions, this dataset scraped from Steam. Also includes the code for getting player counts. More games can be scraped in this manner, and the amount of reviews scraped should probably be reduced substantially when doing so. This dataset contains all reviews for any game on Steam that is specified (or spidered, though I haven't done that yet). And, as stated, the other code is for getting player counts for any specified games.
Intuitions:
Countries with similar positions in certain areas (or on all issues) can be identified by the transcript of the statements made by the representatives of the United Nations, the content of the draft, and the voting results.*
Countries that have similar ideas on an issue are more likely to have similar cultural backgrounds, national strengths, etc. Small island states, for example, are concerned about climate change.+
Dataset:
UN digital libaray could be found there
Intuition 1*: Fanfiction stories with negative themes or conflicts garner more attention and engagement compared to those with more traditional, fairy tale-like plots. The preference becomes more obvious nowadays than before.
Intuition 2+: Fan-fiction narratives are increasingly adopting similar plots and themes - there is a noticeable trend of decreasing innovation and novelty in fan-fiction. This could be attributed to writers being influenced by other fan-fiction works, leading them to replicate popular themes and plots. Alternatively, it may reflect a growing preference within the community for familiar themes, possibly due to a reluctance to experiment with innovative ideas that might not be as well-received.
This trend, if confirmed, would not only be surprising but also significant, as it challenges the commonly held perception of fan-fiction (participatory culture) as a highly creative and diverse domain.
Intuitions:
The textual content on dating app profiles (e.g., self-intro, hobby, future goals) are associated with the demographic information (e.g., age, gender religion). + The topics of those textual content is predictive of users demographic information. * If we treat demographic information as the labels, we should fine-tune a pre-trained LLM on the dating profile generation task. Although the evaluation of such a fine-tuned model can be tricky.
Dataset: The dataset: https://www.kaggle.com/datasets/andrewmvd/okcupid-profiles/data.
Possible intuitions:
Congressional legislation regarding abortion link
Intuitions: (1) * Media outlets regarding abortion-related news are often framed according to their political or ideological perspectives. Conservative outlets might emphasize aspects like the rights of the unborn or religious perspectives, while liberal outlets may focus more on women's rights and personal autonomy. (2) The way abortion is covered can vary based on the geographical location and cultural context of the media outlet (e.g., religion, political preference). (3) Coverage might vary based on the political climate, with heightened attention during key legislative debates or elections.
The dataset and script I used to acquire them can both be found here.
The database is small right now because I faced some challenges in scraping text from dynamic web pages. For example, if I want to scrap more than 10 news pieces, I need to click the "load more" button on the search results page. But I don't know how to do that...
Intuitions:
The comments dataset can be found here
Intuitions: 1.+ Trans users of 4chan will have a large emphasis on mental aspects of womanhood (i.e. "thinking" like a woman), with many of the described aspects being describable in terms of habitus. 2.* For a forum nominally for lgbt people in general, /lgbt/ will have a disproportionate amount of discussion about and from trans people, particularly trans women.
Dataset: 4chan posts data scraped periodically from board.4chan.org/lgbt/. Script here. I really want to move this to Midway so I can automate collection (instead of just collecting posts whenever I'm at my computer and think about it).
Greater Negative Sentiment in Chinese Version: There might be a more pronounced negative sentiment towards the US government in the Chinese version compared to the foreign version. This could be due to differing editorial policies and audience targeting. *
Neutral or Positive Sentiment in Foreign Version : The foreign version of People's Daily might exhibit a more neutral or even positive sentiment towards the US government, potentially as a strategy to present a more balanced view to an international audience.+
Dataset: Peoples' Daily Chinese version and Foreign Version: Chinese https://github.com/prnake/CialloCorpus Foreign: https://github.com/702036240/Spider-People-s-daily
Users with diverse cultural backgrounds will display distinct language patterns and word choices in their profiles. This could be reflected in the use of specific phrases, idioms, or cultural references. (*)
The age of users might significantly influence the type of slang and language style used in their profiles. younger users might employ more contemporary slang and internet jargon.
There will be notable differences in the expression of hobbies and interests based on geographic location. For example, users from coastal areas might mention more water-related activities. (+)
Dataset Description:
The dataset is from dating-app user profiles, specifically their self-introductions or bios. These texts are rich sources of personal information, including religious beliefs, lifestyle, interests, hobbies, and more.
*1. In terms of perceived cultural dimensions like violence, morality, and trustworthiness, Hui ethnic groups sit in between Han and Uyghur in the Chinese-language corpora. +2. Certain areas of representation may deviate from conventional cultural association due to the persistence of political language
Data: https://drive.google.com/file/d/1abO2GPDHMmXw6tSrx3f5eXg6lZLtK0Yp/view?usp=sharing Tencent Baseline: https://ai.tencent.com/ailab/nlp/en/download.html
Intuitions:
*+Firstly, I believe that the sentiment of adoptees; posts in Adoptee oriented subreddits will be more negative than that of adoptees in more general adoption subreddits. I also theorize that posts expressing negative sentiment towards adoption will garner a higher voting score in adoptee oriented subreddits than those in a more general adoption subreddit. One portion of the data is available online as a json file (https://the-eye.eu/redarcs/ the search for r/adoption in the search bar). The other portion I can send via dropbox or some other file sharing service if necessary.
Patterns to observe
Dataset
I intend to use Semantic Scholar API to scrape psychology papers over the past 10 years. Here is the sample code to extract and clean data. This is the link for National Science Foundation to extract funding information.
Intuitions: 1*. Since my corpus was retrieved from feminist professional social media communities, I think the attitude towards male community members over time could evolve in a specific pattern. For example, when the group "Women in Social Science" began to exclude all the male members, the attitude toward males could be aggressive, and thus influence other communities in the same niche to reflect on gender issues in group management. Co-evolution or ignition of aggressive affect could happen.
Dataset: https://drive.google.com/drive/folders/1uy3XpwctnhN_IwSbyDNkT9mv7ohtq0sK
I've been surprised by the results of topic modeling the data set of Reddit posts I am using. Since the data et consists of two categories of posts, depressed and non-depressed, I thought 2 clusters (left) would definitely be optimal. However, 3 (right) is actually much better according to the metrics:
For 2 clusters: Homogeneity: 0.118 Completeness: 0.137 V-measure: 0.127 Adjusted Rand Score: 0.123
For 3 clusters: Homogeneity: 0.381 Completeness: 0.256 V-measure: 0.306 Adjusted Rand Score: 0.358
It seems like the really depressed people are the purple ones in both charts. 3 clusters being better seems to imply that the difference between the red and green groups is significant -- maybe as significant as that between either "normal" group and the depressed people. But who are the red and green groups? The green group cares (or talks) about friends a lot. A couple of their top words are "boyfriend" and "hang". Sounds like young people! Meanwhile, the reds talk about family, home, and work, and among their top words are "husband" and "old", so they are definitely older.
My question is, what will happen when you apply the same clusters to a (relatively) normalized data set like COCA? My hypotheses: *The purple group will represent a small fraction of the overall population, but one that has grown over time. If you go back in time, the purple group will fade away, indicating that widespread depression is a modern phenomenon. Meanwhile, the red group will get larger, since older people represent more conservative attitudes. +However, there will be a constant emergence of a "new" red group as we go back in time. The older generation will become young compared to their forebears.
candor corpus: conversational parties that talk about family or politics are more likely to have either more or less flowing conversations - i am thinking more animated, with less filler words and pauses+ pairs that talk about neutral topics have more stilted conversation *
booth studies: conversations that are marked as 'dialogue' will have more filler words* conversations marked as debate will have more individual topics and more personal topics +
candor corpus is available for download on request from here: https://www.science.org/doi/10.1126/sciadv.adf3197 still collecting the other data from the booth studies - i am not sure if the professors would give me permission to share, will check for next week!
Here are some content patterns I am considering for my project on tracking perspectives and discourse on “CRT”.
I may be able to map a cultural shift in the fundamental use of the phrase CRT during the period of 2020-2022 over various types of media. To begin, I may associate the term with other adjectives over time. I highly expect to see that the phrase “CRT” holds a specific, unique connotation in the popular-news media over a particular time period (not surprising).
Patterns could be identified in how certain individuals, groups, and public intellectuals have a disproportionate impact on framing narratives and influencing public perception. It may be interesting to explore how long these entities hold influence over a portion of society and whether/when shifts in perceptions occur.
I am still using the S2ORC dataset (database of millions of journal articles across disciplines, but will be focusing on social science articles) (https://github.com/allenai/s2orc) and the NOW dataset (a continuously updating collection of news articles spanning many decades with billions of words). https://github.com/allenai/s2orc, https://www.english-corpora.org/now/.
Intuition: Games with that is stitched from different genres are more likely to receive reviews in different perspectives and perhpas more extreme.
Some games, depending on specific cultural norm or the special way of delievery of the game, may receive surprising reveiw keywords. (eg: Pokemon like games been received as rediscovering slavery)
Dataset: For all intuitions, this dataset scraped from Steam. Also includes the code for getting player counts. More games can be scraped in this manner, and the amount of reviews scraped should probably be reduced substantially when doing so. This dataset contains all reviews for any game on Steam that is specified (or spidered, though I haven't done that yet). And, as stated, the other code is for getting player counts for any specified games.
Possible intuitions:
Congressional legislation regarding abortion: Link
My biggest intuition is that my corpus is not suitable for clustering. So even though the algorithm can try its best to classify clusters but the scores are low and also no intuitive information.
Words that changed the most from the cleaned telehealth corpus from 2017 to 2023:
Words like 'pacifica', 'persistence', 'analytics', 'evangelist', 'telecardiology', 'mantra', 'enduser', 'wise', and 'devoted' are among those that have shown significant semantic shifts. These words likely represent emerging concepts, technologies, or trends that have gained prominence or evolved in meaning during the period studied. For instance, 'analytics' and 'telecardiology' might indicate a growing focus on data analysis and remote healthcare, respectively. 'Evangelist' in a modern context often refers to someone who promotes a particular technology or innovation, which could suggest an evolving role in the tech or business sectors. Data is from purchased NOW data from 2017-2023
Intuitions:
Sentiment Analysis: I expect to observe varying sentiment patterns in the comments, reflecting readers' emotions and opinions towards the articles' topics and contents. +Topic Engagement Over Time: The most surprising finding would be a significant shift in engagement levels across different topics over time, indicating evolving reader interests and societal trends. Description of Dataset:
I will be exploring these intuitions using a dataset containing comments made on articles published by the New York Times. The dataset covers the periods of January to May 2017 and January to April 2018. It consists of two CSV files: one containing information about the articles and another containing information about the comments made on those articles.
(a) Link to the data: New York Times Articles and Comments Dataset
Intuitions: *The prevalence of burnout language will be higher in tweets from user accounts that frequently post about work and productivity.
' + Posts containing expressions of burnout will receive more engagement (likes, retweets) than typical tweets, indicating a communal resonance with the sentiment of burnout. This would be surprising as it suggests a collective acknowledgment and shared experience of burnout.
To export these intuitions I will download several subreddits and create a large corpus of posts filtered by hashtags or keywords associated with work, productivity, and burnout.
1 *. There was a decrease in users' sentiment after COVID-19 in different Reddit communities
Narrative Complexity and Psychological Richness: My primary intuition is that stories with higher levels of narrative complexity and psychological richness—characterized by unexpected encounters, emotional depth, and detailed descriptions—correlate with a greater sense of relationship fulfillment and resilience. This intuition is grounded in the notion that complex narratives may foster a stronger, more nuanced bond between partners, potentially influencing the longevity of their relationship.
Initial Meeting Context and Marital Stability:A secondary, yet intriguing, intuition is that the context of a couple's initial meeting (e.g., through unexpected circumstances vs. traditional settings) might impact marital stability.
To explore these intuitions, I will build an embedding model on the dataset comprising 206 'how they met' stories collected from The New York Times wedding announcements, published between 2006 and 2010. This dataset, enriched with human and machine ratings of narrative interestingness and psychological richness, offers a fertile ground for examining the nuances of romantic inception narratives.
Due to the sensitive nature of the personal stories and the proprietary constraints of the data collection method, the dataset cannot be made publicly available.
Post your response to our challenge questions.
First, write down two intuitions you have about broad content patterns you will discover in your data. These can be the same as those from last week...or they can evolve based on last week's explorations and the novel possibilities that emerge from continuous, high-dimensional embeddings. As before, place an asterisk next to the one you expect most firmly, and a plus next to the one that, if true, would be the biggest or most important surprise to others (especially the research community to whom you might communicate it, if robustly supported). Note that these expectations become the basis of abduction--to condition your surprise. Second, describe the dataset(s) on which you will build an embedding model to explore these intuitions. Then place (a) a link to the data, (b) a script to download and clean it, (c) a reference to a class dataset, (d) an invitation for a TA to contact you about it, OR (e) a brief explanation why the data cannot be made available. Please do NOT spend time/space explaining the precise embedding or analysis strategy you will use to explore your intuitions. (Then upvote the 5 most interesting, relevant and challenging challenge responses from others).