5. Classifying Meanings & Documents - challenge

JunsolKim commented 2 years ago

Post your response to our challenge questions.

First, pose a research question you would like to answer (in one, artfully worded sentence ending with a question mark). This could be the same question you posed for any prior week's assignment, or a new one that improves on or updates it. Second, identify a prediction that will enable you to answer your question, or validate (prove the value/relevance of) your answer. This prediction could simply be the question itself (e.g., How do I predict stock price from published company information?) or it could support or validate the answer of that question (e.g., How do I predict whether the sentiment of a given sentence is positive or negative, certain or uncertain, written by author A or author B, resonant with U.S. Republicans or Democrats, about environmental position X or Y, etc). Finally, describe the datasets on which you will (a) train your prediction model, (b) test that model, and (c) generalize that model to new, unclassed or valued cases. Parenthetically note whether this data could be made available to class this week for evaluation (not required...but if you offer it, you might get some free work done!) Please do NOT spend time/space explaining the precise model or analytical strategy you will use to generate, evaluate and utilize your prediction. (Then upvote the 5 most interesting, relevant and challenging challenge responses from others).

GabeNicholson commented 2 years ago

I am interested in how Covid was/is discussed in the media from the start of the pandemic. Specifically, what kind of connotations and sentiments has changed over time and possible word embedding differences over time. An intuitive way would be to predict sentiments and word embeddings given a paragraph in 2020 with one more recently. If a successful prediction can be made, then that shows not much has changed since early 2020.

The dataset would be from the Coronavirus Covid dataset: https://www.corpusdata.org/corona_corpus.asp There is more than enough data to split this dataset into multiple test/train splits for a given month.

konratp commented 2 years ago

I am interested in studying how East Germans are represented by members of the federal German parliament, the Bundestag. My question is, do members of parliament, in the way they refer to the German Democratic Republic, accurately represent East Germans, many of whom felt better off during the communist days than today? How do I predict if sentiments towards the GDR in parliamentary speeches have changed over time? The dataset I will be using stems from the Open Discourse project and can be found here.

pranathiiyer commented 2 years ago

I am interested to see how composition of Indian matrimonial ads in newspapers has changed (or not changed) with time. These ads could be used to train a model that identifies if a certain ad is biased towards a certain community, or certain attributes of appearance and could then be used to predict the composition of other such ads in the future. I have data from newspaper ads across several years, which can easily be split into training and testing data. Could also only sample ads from later years to train the model if there is visible change in these ads (which seems like there is not). Data can be found from the ads section of these archives.

Jiayu-Kang commented 2 years ago

Research Question/Predictions: how do I predict the score/rating (1-5) of a review based on its text? How do I predict its helpfulness (measured by user voting) based on its text? Dataset: Amazon movie review data available here.

NaiyuJ commented 2 years ago

Are ethnic minorities in China content with the state policies and political institutions (like government), especially compared to Han Chinese? What are the differences among different ethnic groups? What do they care about in their daily life? Ethnic minorities in China are generally satisfied with their life, the government, and other political institutions although they're economically advantaged because the Party penetrates ethnic regions with preferential policies, making minorities feel favored in daily life. I will do text analysis on the online discussion of different ethnic groups in their corresponding forums.

Sirius2713 commented 2 years ago

I'm interested in how Trump impacted stock markets via Twitter? How his tweets mentioning listed companies impacted the performance of these companies in stock markets? The prediction task will be using Trump's tweets to predict the stock price of a company after the related tweets posted.

Dataset: the archive of Trump tweets. There's enough data to be splitted into train/test sets.

Jasmine97Huang commented 2 years ago

Given a body of lyric, can I predict the gender of the artist? If there is a gendered difference between the vocabularies used in music lyrics, then the decision boundary would be pretty easy to find. Dataset - Billboard + Spotify 190,000 songs.

ValAlvernUChic commented 2 years ago

I'm interested in how to predict whether an article about race in Singapore is aligned with the state narrative on race or not. This could likely be done using two datasets, one of the national library's books 'about Singapore' and another of newspaper texts over 10 years.

Hongkai040 commented 2 years ago

I am really interested in people's self-expression of their emotions in their comments made under a movie, a song, or a piece of news. I wonder are people tend to give comments containning stronger emotions(both negative and positive) over time? I come up a very fast way to examine if it's a question worth further exploration. We can count the usage of "?" and "!" and normalize it to compare!

data: comments from movie.douban.com. I've got more than 4M comments on my computer.

hshi420 commented 2 years ago

I am interested in pattern of cyberbullying. How can we predict if a tweet is cyberbullying without necessarily using key words detection? There is a datset on kaggle: https://www.kaggle.com/andrewmvd/cyberbullying-classification.

Qiuyu-Li commented 2 years ago

First, the research question: Is there any observable difference between tweets from left-skewed media and right-skewed media? If there is, is it in terms of tone, or topic, or others, and which one dominates? Second, the hypothesis: Yes, left-skewed media indeed has some distinct features compared with right-skewed media. Third, the data: (a,b) for training and for testing: tweets from the official accounts of commonly considered left- and right-skewed media (such as CNN and Fox News). Randomly divided into training and testing sets. (c) some other media that we are not sure about the skewness. Fourth, data availability: I’ve collected some tweets from left- and right-skewed media (i.e. as the training and testing dataset), but I haven’t got data for extrapolation.

chuqingzhao commented 2 years ago

Research Question: how to predict the firm's financial performance based on their public communication strategies such as pitch, pivot and presentation? Hypothesis: Firms that can meet market expectation and pivot are more likely to perform better and success. Data: Crunchbase pitch, earnings call

isaduan commented 2 years ago

Question: how does pro-democracy revolution or social movements change society's perception of democracy? Prediction task: train a classifier to predict whether a Tweet is published before the revolution or afterwards. Use the inverse of accuracy score to measure the volume of change. Data could be a sample of tweets before and after Arab Spring in Middle East countries that contained keywords like democracy.

LuZhang0128 commented 2 years ago

I'm interested in how the topic and social network evolve in online social movements. One hypothesis is that the topics related to #BLM before and after some major event, like the death of George Floyed, are significantly different. The dataset I'm using is a set of randomly sampled tweets with hashtag #BLM.

sudhamshow commented 2 years ago

Q. Is it possible to attribute several of the famous riots (John Lewis - Bloody Sunday - Selma, Jan 6th attack on Capitol, 2002 riots Gujarat, India) to the speaker's call for action? Currently scraping, transcribing and translating several relevant historical speeches for the purpose.

YileC928 commented 2 years ago

The question I am interested in is how do retail investors and professional investors behave differently on social media. I will be focusing on text posts of investors and may aid the study by exploiting additional information which is generated by interactive activities (e.g., follows, comments, and likes).
I will be scraping tweets and user profiles (label them as retail vs. professional investors).

kelseywu99 commented 2 years ago

My question would be what characteristics that one player look into when she or he is choosing an avatar to play? Extending Hoffner's research on wishful identification between media users and TV characters, I would hypothesize that a player chooses an avatar based on his or her desire to become just like the character for a better immersing experience during the gameplay.

chentian418 commented 2 years ago

I am interested in predicting the direction and dimension of the monthly earning forecast revisions for individuals analysts. I will focus on Dow Jones Newswires to extract incremental value-relevant information to help explain the revision. The data is from Dow Jones Newswires.

facundosuenzo commented 2 years ago

How do newspapers cover and frame technological materials (platforms, programs, services) over time? How are those technological materials positively or negatively associated with ideas of the "future"? Given how certain technologies were framed, I aim to predict how the press will receive and talk about future technological developments. I'm extracting my sample from the NOW corpora (US - filtering by articles that talk about "technology").

Emily-fyeh commented 2 years ago

I am interested in generalizing and predicting how the Taiwanese identity strengthens on the social media platform. Taiwan is not officially recognized as a country by most parts of the world, so Taiwanese seems to be more sensitive at foreign recognition and affirmation, such as ranking high on the democracy index, or being able to develop de facto diplomatic relationships with other countries. I would like to measure the expression on social media to see if Taiwan's identity peaks through these incidents. The common social media in Taiwan would be Facebook, and Twitter is commonly used for spreading Taiwanese consciousness to the (English speaking) world.

ttsujikawa commented 2 years ago

Research question: I'm interested in seeing how cultural background affects our ways to build relationships with people. I would focus on distinct seasons of the Japanese reality shows, terras house to see how people behave differently in a culturally diverse setting. Source: netflix subtitle

ZacharyHinds commented 2 years ago

My research question is how the Involuntary Celibate movement adapts its language (especially slang) to establish strong narratives within their posts and comments. For my prediction, how can I predict the contexts of incel slang, such as the words or phrases that tend to be used in conjunction?

My data set is a collection of a few thousand Incel posts/comments from an archived version of the Incel forum Incels.is (can be made available on request).

UChicago-Computational-Content-Analysis / Readings-Responses-2023

5. Classifying Meanings & Documents - challenge #28