Classifying Meanings & Documents - Challenge

jamesallenevans commented 3 years ago

First, pose a research question you would like to answer (in one, artfully worded sentence ending with a question mark). This could be the same question you posed for any prior week's assignment, or a new one that improves on or updates it. Second, identify a prediction that will enable you to answer your question, or validate (prove the value/relevance of) your answer. This prediction could simply be the question itself (e.g., How do I predict stock price from published company information?) or it could support or validate the answer of that question (e.g., How do I predict whether the sentiment of a given sentence is positive or negative, certain or uncertain, written by author A or author B, resonant with U.S. Republicans or Democrats, about environmental position X or Y, etc). Finally, describe the datasets on which you will (a) train your prediction model, (b) test that model, and (c) generalize that model to new, unclassed or valued cases. Parenthetically note whether this data could be made available to class this week for evaluation (not required...but if you offer it, you might get some free work done!) Please do NOT spend time/space explaining the precise model or analytical strategy you will use to generate, evaluate and utilize your prediction. (Then upvote the 5 most interesting, relevant and challenging challenge responses from others).

Bin-ary-Li commented 3 years ago

Question/prediction: How do I predict art auction price from the description of the artwork and artist's names.

Dataset: scrapped auction entries from Sotheby's website. leave out 5% of the data for final testing. 95% for modeling. The goal is to be able to generalize to any new entry Sotheby's listed. (Dataset is available but cleaning needs to be done.)

chiayunc commented 3 years ago

Question: the same as before, how does legal rhetoric shift within international climate change law? Prediction needed to answer this question: predict whether a certain passage or piece of content is more about which legal instrument.

Dataset: I might need blog posts from NGO, scientific report from scientific group, news or papers that contains content that is prelabled (for example, this is a corpus correponding to article X of the UNFCCC). Train on the prelabeled data set and predict/label the UNFCCC text.

jacyanthis commented 3 years ago

Question: Can we predict whether AI news coverage is focused on AI performance, meaning AI that is economically efficient and productive, or AI ethics, meaning AI that is fair and beneficial to society?

Dataset: News coverage, such as via News on the Web (NOW) or ProQuest (NOW is available but is large and takes some time to format into a dataframe.)

theoevans1 commented 3 years ago

Question: What kinds of narratives do fans look for in fan works and fanfiction?

Prediction: How can I predict the number of likes (kudos) on a fanfiction story based on factors like tags, included characters, and language used in the story?

Dataset: Stories and metadata from fanfiction sites like archiveofyourown.org or fanfiction.net, assembled over a set period of time

william-wei-zhu commented 3 years ago

Question: Can we predict the name of characters in the Office (US) by their lines in the script?

Dataset: Here is the complete Office transcript.

k-partha commented 3 years ago

Question: Can we predict a person's career path (choice of industry, graduate school etc.) based on their profile and undergraduate/high school information?

Dataset: Linkedin profiles spidered/scraped from the web (data not immediately available).

toecn commented 3 years ago

Question: How does populist rhetoric emerge and change in a well-established democracy?

Prediction (label task): Label Donald Trump's tweets according to the following categories: 1) Attack to group--e.g., immigrants 2) Attack to individual--e.g., Obama, Clinton, Pelosi 3) Attack to institution--e.g., Electoral process 4) Other ...

5, 6, 7) As Hopkins and King (2010) state categories should be mutually exclusive so 1, 2, 3 can be combined into the additional three alternatives).
Draw random sample per year since 2012 to code.

Data: http://www.trumptwitterarchive.com/

jinfei1125 commented 3 years ago

Question/Prediction: How to predict unemployment rate/people's disposable income/even financial crisis based on online discussion about personal finance?

Dataset: 'Hot' Articles in the Personal Finance subreddit: Data

jcvotava commented 3 years ago

Question: How is the popularity of horror stories related to their themes and subject matter?

Prediction: Reddit short stories on r/nosleep (popular forum for short, amateur horror stories) will be less popular the more associated they are with fantasy/supernatural themes and language

Data: Reddit page: https://www.reddit.com/r/nosleep/ Python Reddit API wrapper: https://praw.readthedocs.io/en/v2.1.21/ Scrape a portion of subreddit's stories and use random sampling to create training and test divisions.

(Not my final project but an interesting question)

romanticmonkey commented 3 years ago

Question: Do professional movie reviews differ in discourse focus than unprofessional (layman) movie reviews?

Prediction: The perspectives of movie critics show more focus on the film design, while the general audience cares more about the "enjoyability" of the film.

Dataset:

Rotten Tomatoes critics reviews (https://www.kaggle.com/stefanoleone992/rotten-tomatoes-movies-and-critic-reviews-dataset)
Amazon Movie & TV reviews (need to filter TV) (https://nijianmo.github.io/amazon/index.html)

dtanoglidis commented 3 years ago

Question: What are the salient, location-specific, characteristics in the way people describe their stay in different cities across the world, as coded in Airbnb reviews?

Prediction: If we get a collection of Airbnb reviews we can predict the city/country it comes from (after removing, of course, location names etc. from the reviews).

Dataset: Inside Airbnb (http://insideairbnb.com/)

MOTOKU666 commented 3 years ago

Question: How do I predict the type of immigration policies:(accommodating, restrictive or neutral)? That is to find the characteristics of the words implying the corresponding type

Dataset: NCSL dataset https://www.ncsl.org/research/immigration/immigration-laws-database.aspx#database

sabinahartnett commented 3 years ago

Question: Are there linguistic similarities in different subcategories of newspaper publications? (i.e. are articles published in 'World News' more similar to each other than, say 'Politics'?) And do any correlations exist between categories? Can we predict an article's category based on its text?

Dataset: Various news publication sites (split into training and testing sets) - more diversity of publication, more generalizable the model will likely be. (data not collected/available)

xxicheng commented 3 years ago

Question: Is classical musical disappearing in the news coverage under the impact of popular culture?

Dataset: News on the Web (NOW)

Raychanan commented 3 years ago

Question. How can we predict the increase or decrease of COVID-19 infection using text data related to "going out" that people post on social media?

Data: Twitter

hesongrun commented 3 years ago

Question: How do we predict stock movement given the textual information released about the company? (This is pretty interesting given the recent Gamestop event, can we use discussion on public forum to individual investors' interest in certain stocks?)

Data: news, company announcement, discussion on online stock trading forums.

dtmlinh commented 3 years ago

Question: Can we predict whether a news article (or a piece of text) is supporting or opposing climate change mitigation?

Dataset: News coverage

mkjang17 commented 3 years ago

Question: How do I predict whether a consumer review contains more subjective information or objective information?

Data: Amazon.com consumer reviews corpora, MTurk ratings,

egemenpamukcu commented 3 years ago

Question: Can we predict the 'winner' of a debate in the eyes of an audience?

Data: Intelligence Squared and Munk Debates debate transcripts (not available as a corpus yet), and audience votes in regards to debated statement both before and after the debate (to train the algorithm). Debates that have a declared winner (measured by the difference in audience votes) can be used to measure the accuracy. Use of vocabulary, grammar and positivity/negativity in language can be introduced as predictors.

yushiouwillylin commented 3 years ago

Question:Can we find patterns in the ideologies of different social science fields (eg. Economics, Sociology, and Anthropology)?

Prediction: If we focus on conservative and liberal division, then I guess Economic paper will lean toward conservative, Sociology and Anthropology vice versa. From time series data, it might even be possible to see how the ideology change future generation of researchers. For example, maybe "Capital in the Twenty-First Century" changed the ideology pattern of published papers in different fields.

Data: COCA and other academic paper online.

lilygrier commented 3 years ago

Question: Can we predict whether the text of a climate-related executive order is calling for additional regulation or a decrease in regulation?

Data: Federal Register of Executive Orders from Clinton to present, filtered to include only climate-related orders as included in this Climate Regulation Database. NOTE: Accessing the text of the bills does require a bit of web-scraping and so will not likely be suitable for this afternoon's class exercise.

RobertoBarrosoLuque commented 3 years ago

Question: are there language/text characteristics that makes political speeches more persuasive?

Data: Miller Center Presidential Speeches corpus

UChicago-CCA-2021 / Readings-Responses

Classifying Meanings & Documents - Challenge #50