5. Machine Learning to Classify and Relate Meanings - Challenge

lkcao commented 6 months ago

Post your response to our challenge questions.

First, pose a research question you would like to answer (in one, artfully worded sentence ending with a question mark). This could be the same question you posed for any prior week's assignment, or a new one that improves on or updates it. Second, identify a prediction that will enable you to answer your question, or validate (prove the value/relevance of) your answer. This prediction could simply be the question itself (e.g., How do I predict stock price from published company information?) or it could support or validate the answer of that question (e.g., How do I predict whether the sentiment of a given sentence is positive or negative, certain or uncertain, written by author A or author B, resonant with U.S. Republicans or Democrats, about environmental position X or Y, etc). Finally, describe the datasets on which you will (a) train your prediction model, (b) test that model, and (c) generalize that model to new, unclassed or valued cases. Parenthetically note whether this data could be made available to class this week for evaluation (not required...but if you offer it, you might get some free work done!) Please do NOT spend time/space explaining the precise model or analytical strategy you will use to generate, evaluate and utilize your prediction. (Then upvote the 5 most interesting, relevant and challenging challenge responses from others).

sborislo commented 5 months ago

Research Question: Can a videogame have its genre identified purely based on its players' reviews?

Prediction: A videogame's genre can be predicted (among a list of possible genres) based on a representative set of reviews for that game.

Datasets (the data is not very clean as of now, so best not to share it): (a) I would train the prediction model on scraped Steam (an online videogame retailer) reviews (of twenty games per genre). (b) I would test the model on scraped Steam reviews (of two games per genre, the games naturally being different from those in the training set). (c) I would introduce five Steam games (and their scraped reviews) from genre(s) not trained-on for the model to make predictions about.

[For my reference later (sorry for the extra detail!): I would measure the similarity between the genre and other genres (and/or intuit it), and then see which genre the model assigns the unclassed game to (assigning probabilities to each possibility). An error term could then be calculated].

bucketteOfIvy commented 5 months ago

Research Question: How do trans people use 4chan's /lgbt/ board as a forum for identity building?

Prediction: A (semi?) supervised model which predicts whether a given post was by or about a trans person or not given the contents of the post represented in word embedding space.

Datasets: (a) I can train the model on posts pulled from /lgbt/ from Sunday of sixth week to Sunday of seventh week. Specifically, I can randomly select subsets of these posts to label according to a rough criteria, and then use a semi-supervised methodology to scale up the analysis. (b) I can hold out some of the initially labeled test data as part of the dataset. In this case, I mostly care about the precision of the model, so labeled positive examples should be fine. (c) I can run the model on a separate samples of data; if the precision is decent, it should be able to identify relevant subsets of posts.

(I could show some unlabeled preliminary data to the class next week, but it's both not too clean and potentially a cognitohazard, so I'm not sure that would be wise).

ethanjkoz commented 5 months ago

RQ: Do adoptees in adoptee-centric online spaces express more negative sentiment towards adoption?

PREDICTIONS: Predictions that will enable me to go about completing this question will involve predicting whether posts are made by adoptees and whether posts are positive or negative in sentiment.

DATA: The data set comes from 2 subreddits.I have access to archived data for one of them (r/Adopted) up until 2023. I also have data from r/Adopted that I had to scrape myself. I also supplemented r/Adoption data with more recent posts that I also scrapped myself. The one problem with these data are that there are few ground truth labels to determine whether a poster is an adoptee or not. I will split the training and testing on both of these subreddits. I will generalize my model to unclassed cases from these data. (I could make the data available as I have spent considerable time organizing the data somewhat, but it is a somewhat large csv file (370 mb).

yuzhouw313 commented 5 months ago

Research Question: Are there discernible differences in the emotional tone (e.g., fear, anger, hope) of comments under conservative, neutral, and liberal news channels during key periods of the COVID-19 pandemic?

Prediction: The emotional tone of user comments will significantly vary across conservative, neutral, and liberal news channels on YouTube, with conservative channels showing higher expressions of anger, liberal channels showing more expressions of hope, and neutral channels displaying a balanced mix during key periods of the COVID-19 pandemic.

Dataset: I carefully chose 3 YouTube News channels based on political orientation (conservative, neutral, liberal) Fox news, ABC news, and MSNBC. Then I scraped all their YouTube news video links and randomly chose one per month per channel in 2020 during the covid-19 pandemic. Finally, using these links I scraped all comments and replies in the comment section under these videos.

Train: A subset of randomly selected yet stratified comments from 3 news channels to ensure representation from different periods of the pandemic.
Test: A separate, distinct subset of the comments, which has also been labeled with emotional tones but not used in the training phase.
Generalize: (potentially) New comments from more recent dates beyond the training and testing sets or comments from different news videos that were not part of the initial dataset.

(My corpus is fully collected, but since it is not very clean I am not sure if it will be useful to share)

cty20010831 commented 5 months ago

Research Question: Is there a "trap" of funding in which psychology researchers with funding are increasingly loosing the diversity of the resarch topics they examine?

Prediction: There might be a change in the diversity (in terms of "distance" measures in vector space/embedding models) of research topics/keywords the funded psychologists study before and after the funding.

Dataset: There are two parts of my dataset. The first one is the retrieved list of funded psychology NSF projects. The second one is the basic personal (e.g., title, schools/institutions they belong to) and publication-related information (e.g., citation count, h-index, and the title and abstract of the papers) of the funded authors to be scraped from Google Scholar by relating author information to the NSF funded list.

erikaz1 commented 5 months ago

Primary RQ: To what extent has the history of people/individuals been preserved in our collective memory in a way that is at odds with individual recollection?

Prediction: I will predict whether the types of verbs and proper nouns in FW’s journals differ compared to those in news articles, interviews, and textbooks on FW. (Classification task.)

Datasets: Four separate collections of documents representing the four sources of “recorded history” to compare. As for (a), (b), I can split my corpus into train and test portions. For ©, it might be interesting to look at a different historical figure and create a completely new corpus, and see if the model might apply (though the exact verbs and nouns will likely be different and need to be reidentified.) (Exactly what each of these collections will contain is still under the works. I’m not sure if I will be able to collect all of this data by Friday, but definitely within the next few days.)

donatellafelice commented 5 months ago

Research Question: Are there linguistic markers associated with debate conversational style?

Prediction: We can predict if a conversation has been assigned to the debate group (vs dialogue or control) in a controlled chat based experiment.

Datasets: one full study transcript is available - I will confirm I can share with the class with the professor before class tomorrow. The study matched people who significantly disagreed on a specific topic (abortion, white privileged, gun control) and advised them to either debate, dialogue or talk (control group).

ana-yurt commented 5 months ago

Research Question: Chinese perception of Hui and Uyghur Muslims in Chinese-language discourse Prediction: Given online posts scraped from under the Uyghur and Hui topic tags, can we predict the topic tag that a specific piece of text belongs to? Furthermore, can we do so without relying on the obvious keywords? Datasets: Texts scraped from Zhihu platform. On Zhihu, all questions are tagged with topics by the users. Besides the Uyghur and Hui topics, I complied 8 topics based on their rates of occurrences with the Hui and Uyghur topic tags.

runlinw0525 commented 5 months ago

Research Question: How are U.S. public universities adapting their educational policies, particularly within course syllabi, to address “AI” and its associated regulations in an ethical manner, especially in guiding instructors and students?

Prediction: If university-wide guidelines appear to support the usage of generative AI or advocate for its ethical application, then this may be reflected in an overall increased emphasis on AI in course syllabi.

Datasets: A collection of course syllabi published in or after 2023 from one U.S. public university. They have been scraped and turned into a dataframe but I am still working on the labeling of documents.

Marugannwg commented 5 months ago

Research Questions: How do dialogues within video games reflect the perceived moral and personality attributes of characters versus those found in players' commentaries, and what does this reveal about player values and preferences in character-driven narratives?

Relevant prediction task: Use the embedding of the game character dialogue to predict some aspects of the players' commentaries embedding. (Or reversely?)

Dataset

Dialogues and narrative from selected characters from games; ()
Popular discussions, commentaries, or fanart revolving around the selected characters.

michplunkett commented 5 months ago

Research Question: Can you predict UCPD incidents classified as Information to a more valid category strictly from its description?

Prediction: The description text for a given UCPD contains enough signaling information to accurately place it in a more correct and descriptive category.

Dataset:

All incidents from July 2011 going forward: https://incidentreports.uchicago.edu/
- Scraping repository: https://github.com/michplunkett/ucpd-incident-scraper
- A preliminary model using the XGBoost algorithm is implemented in this repository that uses a 35% and 65% testing and validation split.

volt-1 commented 5 months ago

RQ: Can we infer a person's religious beliefs (atheism or theism) from the language patterns and word choices in their dating app user profile?

Prediction: a text classifier to predict whether a dating app user is atheist or theist based on the content of their profile text.

Datasets: Contain behavioral details like drinking habits and drug usage history. The user profiles annotated with the user's religious belief - either atheist or theist after data cleaning process. This expanded 'essay' or 'bio' data could provide additional signals correlated with religious beliefs. Testing: A held-out portion of the labeled profile dataset.

h-karyn commented 5 months ago

Intuitions:

The textual content on dating app profiles (e.g., self-intro, hobby, future goals) are associated with the demographic information (e.g., age, gender religion). + The topics of those textual content is predictive of users demographic information. * If we treat demographic information as the labels, we should fine-tune a pre-trained LLM on the dating profile generation task. Although the evaluation of such a fine-tuned model can be tricky.

Dataset: The dataset: https://www.kaggle.com/datasets/andrewmvd/okcupid-profiles/data.

QIXIN-ACT commented 5 months ago

Research Question: In the evolving landscape of participatory culture, how has the narrative length and thematic content of fan fiction changed over recent years?

Prediction: It is hypothesized that contemporary fan fiction is becoming shorter in length and increasingly explores themes of darkness and violence compared to earlier works.

Datasets for Model Development and Validation:

Training Dataset: A comprehensive collection of fan fiction texts spanning various genres, time periods, and fandoms, sourced from major fan fiction repositories. This dataset will serve as the foundation for developing a prediction model capable of analyzing narrative length and thematic content. The analysis will focus on identifying shifts in thematic emphasis towards darker and more violent content, as well as changes in the average story length over time. (Availability: The data is meticulously cleaned, yet it contains a significant amount of NSFW content due to the inherent nature of fan fiction, which often explores romantic and adult themes. Given the sensitivity of this content, it is advisable not to share it widely within the class.)

Test Dataset: A distinct subset of the fan fiction database, reserved for model testing purposes. This dataset will be employed to evaluate the model's accuracy in predicting narrative length and thematic trends, ensuring that the model's insights are robust and reliable. (Availability: This dataset is derived from the same sources as the training dataset and is subject to the same considerations regarding content sensitivity.)

Generalization Dataset: New, previously unclassified fan fiction articles and texts from additional fan fiction forums. This dataset will be used to assess the model's applicability to current and emerging fan fiction, verifying its ability to generalize findings to novel content. The goal is to confirm whether the identified trends persist beyond the training and test datasets, offering insights into the ongoing evolution of fan fiction narratives.

Caojie2001 commented 5 months ago

Research Question: Future public policy directions can be predicted from news articles issued by the Chinese government. Prediction: News articles about a particular public sector (e.g., healthcare, education, etc.) usually appear in a concentrated period of time that coincides with a period of time when the government is focusing on public policy reforms in that sector. Dataset: the data can be scraped from online websites of Chinese newspapers, such as XinMin.

ddlxdd commented 5 months ago

Research Question: "How can we discern patterns and topics prevalent among individuals with bipolar disorder from their daily experiences shared in the 'How's your mood today' thread on the bipolar sub-forum of Psych Forums?"

Prediction for Validation: To answer this question, your prediction model will aim to identify and categorize the prevalent sentiments and topics in the posts. You will be predicting the emotional tone (positive, negative, neutral) and the primary topics (like medication, therapy, daily challenges, etc.) of each post.

Dataset Description:

Training Dataset:

Source: Posts from the "How's your mood today" thread on the bipolar sub-forum of Psych Forums. Content: The text of the posts and their timestamps.

Test Dataset: Purpose: To evaluate the performance of my sentiment analysis and topic modeling. Selection: A subset of the scraped posts, ideally not used in training, to test the accuracy of my model.

Generalization Dataset:

Source: Could be either newer posts from the same thread or similar threads from other mental health forums.

YucanLei commented 5 months ago

Research Question: Can a videogame have its genre identified purely based on its players' reviews?

Prediction: A videogame's genre can be predicted (among a list of possible genres) based on a representative set of reviews for that game.

Datasets (the data is not very clean as of now, so best not to share it): (a) I would train the prediction model on scraped Steam (an online videogame retailer) reviews (of twenty games per genre). (b) I would test the model on scraped Steam reviews (of two games per genre, the games naturally being different from those in the training set). (c) I would introduce five Steam games (and their scraped reviews) from genre(s) not trained-on for the model to make predictions about.

chenyt16 commented 5 months ago

Research Question: Are media outlets regarding abortion-related news often framed according to their political or ideological perspectives?

Prediction: Conservative outlets might emphasize aspects like the rights of the unborn or religious perspectives, while liberal outlets may focus more on women's rights and personal autonomy.

Dataset: The dataset is the news outlets I scraped from BBC News and FOX News. I will scale up to more media platforms but I haven't done it. I can share the dataset but it's still small.

HamsterradYC commented 5 months ago

Prediction: The prediction will focus on identifying burnout emotion in posts and also identifying shifts in the frequency and themes of burnout-related discourse on social media platforms following major societal events or some particular point for people. This involves predicting the rise in discussions around burnout and identifying the main topics or themes that emerge within this discourse in the aftermath of such events.

Datasets: a)A compilation of social media posts from Reddit and Weibo, dated before and after significant societal events. This dataset will include metadata such as post timestamps, content, and engagement metrics. b)A separate set of posts from similar time frames and platforms, used to validate the predictive model's accuracy in identifying and analyzing burnout discourse. c) Generalization Dataset: New, unclassified social media posts from subsequent events or periods not included in the training or testing datasets. This will help to evaluate the model's applicability to future societal events and the evolving nature of burnout discourse.

The goal is to leverage natural language processing (NLP) and machine learning techniques to analyze the text data, identifying trends in the discussion of burnout and related themes.

Twilight233333 commented 5 months ago

Research Question: The extent to which people's views on issues are influenced by others

Prediction: If they read positive comments in the same tweet after adjusting the order of comments, they may be inclined to respond positively

Dataset: Several case tweets are used to conduct a mock questionnaire

XiaotongCui commented 5 months ago

Research Question: How can user information on OkCupid, specifically data from 2014 containing gender, age, sexual orientation, and textual self-descriptions, be utilized to predict their religious affiliations using common machine learning models?

Prediction: The prediction involves using machine learning models to analyze the existing dataset and predict the religious affiliations of OkCupid users based on their gender, age, sexual orientation, and textual self-descriptions. The model will be trained to understand patterns and correlations between these features and users' religious affiliations.

Datasets:

Training Dataset (a): The 2014 OkCupid dataset, including user information such as gender, age, sexual orientation, and textual self-descriptions. Testing Dataset (b): A subset of new profiles generated on OkCupid, consisting of diverse user information, for evaluating the model's performance.

floriatea commented 4 months ago

Research Question: Can the evolution of telehealth discourse be mapped through changes in linguistic complexity and sentiment in online articles and social media posts over the last five years?

Prediction: Focus on detecting shifts in linguistic complexity (using metrics like sentence length, lexical diversity) and sentiment (positive, negative, neutral) in telehealth-related discourse over time. By analyzing these linguistic features, we aim to validate the hypothesis that significant events (e.g., technological advancements, health crises) are mirrored in the complexity and sentiment of telehealth discussions, with more complex and possibly more positive discourse emerging in response to positive developments in telehealth technology and adoption.

Datasets:

Training Dataset: Compiled from the existing NOW dataset, this will include telehealth-related texts (articles, blog posts, social media content) from the last five years. Features to be extracted and used for training include textual content (text), publication date (date), and derived linguistic features (normalized_tokens, normalized_tokens_count, normalized_tokens_POS), alongside manually annotated sentiment labels.
Testing Dataset: A smaller, distinct subset of the existing NOW dataset, reserved for model evaluation. This dataset will also be enriched with manually annotated sentiment labels and will include the same range of features as the training set to ensure consistency in model testing.
Generalization Dataset: Recent texts not included in either the training or testing datasets, selected to assess the model's performance on new, unseen data. This dataset will help evaluate whether the observed linguistic trends persist or evolve further, providing insights into the current state of telehealth discourse.

(Data availability: The training and testing datasets, being subsets of the existing NOW dataset, could potentially be shared. The generalization dataset's preparation would depend on ongoing data collection efforts.)

This approach focus on linguistic and sentiment evolution as indicators of societal and technological shifts. It stands to offer insights into how public and professional perceptions of telehealth have changed in response to external factors and internal developments within the field.

joylin0209 commented 4 months ago

Research Question: How do engagement patterns differ between posts and comments in r/Fitness, and can we predict whether a given text data belongs to a post or a comment based on its content?

Prediction: I predict that there are distinct linguistic and structural differences between posts and comments in r/Fitness, and by leveraging features such as text length, vocabulary usage, and sentiment, we can build a model to accurately classify whether a given text data belongs to a post or a comment.

Datasets: a) Training Dataset: We will use the provided dataset containing posts and comments from r/Fitness. The dataset includes features such as 'title' and 'comments', along with 'post_id' to establish relationships between posts and comments. By utilizing these features, we can train our prediction model to distinguish between posts and comments accurately.

b) Testing Dataset: For testing our model, we will split the dataset into training and testing sets, ensuring that both contain a representative sample of posts and comments. This split will allow us to evaluate the performance of our model on unseen data and assess its generalization capabilities.

c) Generalization Dataset: Once we have trained and tested our model, we can generalize it to new, unclassified text data from r/Fitness. This dataset will consist of additional posts and comments obtained from the subreddit after the initial dataset collection. By applying our trained model to this new data, we can classify posts and comments accurately and identify any evolving patterns in engagement behavior.

Data Availability: https://www.kaggle.com/datasets/curiel/rfitness-posts-and-comments

Brian-W00 commented 4 months ago

How to predict if city parks make people happier by looking at what they say online in these places? Talking people in cities with many parks will be happier than in cities with few parks. To do this, we will use (a) data from social media with location tags to make the model learn, (b) test the model with data from different cities, and (c) try the model on new places to see if parks affect happiness can be seen. This data may be shared for class look if we solve privacy problems.

Carolineyx commented 3 months ago

Research Question: Can the psychological richness and interestingness of a couple's 'how they met' story, as quantified through computational text analysis, predict the longevity of their marital relationship?

The prediction that will enable us to answer this question—or validate our answer—focuses on the correlation between the quantified metrics of psychological richness and interestingness in 'how they met' narratives and the subsequent marital longevity. Specifically, the prediction posits that couples whose meeting stories are rated high in psychological richness and interestingness are more likely to experience longer-lasting marriages.

The model will be trained on a curated dataset of 206 'how they met' stories from The New York Times wedding announcements, published between 2006 and 2010. This dataset includes narratives rated for interestingness by human raters and analyzed for psychological richness through computational methods. Each entry is labeled with the current marital status (together or not) of the featured couples, serving as the ground truth for training.
The model's effectiveness will be tested on a separate, smaller subset of stories collected from the same source but withheld from the initial training phase. This will ensure the model's predictive accuracy and its ability to handle unseen data.
To generalize the model, I would apply it to a novel dataset of 'how they met' stories collected from other sources or time periods not covered in the training and testing datasets. This step will evaluate the model's robustness and its applicability across diverse narratives and contexts.

UChicago-Computational-Content-Analysis / Readings-Responses-2024-Winter

5. Machine Learning to Classify and Relate Meanings - Challenge #28