Open jamesallenevans opened 3 years ago
Intuitions: (1) +People's financial concerns are relatively stable over time from 2015-2019 (2) Top words in different clusters may overlap (3) * housing, car, retirement, student loans, and tax are the top five topics Data: Posts from subreddit Personal Finance 5 csv files of 1000 posts per year from 2015 to 2019: Download Here
People speaking about Latin American politicians that ran for president (2005-2015):
Corpus del Español: This corpus contains about two billion words of Spanish, taken from about two million web pages from 21 different Spanish-speaking countries. It was web-scraped in 2015.
Class dataset: Corpus del Español ("SPAN").
Comparison of Airbnb reviews from different places
Dataset/Download: Inside Airbnb (http://insideairbnb.com/get-the-data.html)
Dataset: Twitter 'likes' of persons with differing self-identified personalities.
Data TAs please note that the data might have to be reduced in size (while keeping the share of 'type' proportional) in order to be workable for quick analysis.
Subject: Topics and rhetoric change in the Marx-Engels Collected Works (MECW), 1835-1895 Intuitions:
Data: Link to a scraped copy of the MECW, Vol 1-49 (386 mb, saved as a .csv.) Documents are saved in a single data table, organized by "document" number (i.e. volume number of the MECW) and "subdocument" (i.e. individual texts, letters, chapters, etc. published within each volume.) Data needs tokenization, mild cleaning, etc. Key to understanding what's in each volume can be accessed here.
Intuitions
Dataset News on the Web (NOW) Davies corpus. There are at least 70,000 articles from 2010 to 2020 that include "artificial intelligence."
Intuitions: (1) * Film critics speaks differently from the general audience. (2) The cluster of speech patterns of film critics in, e.g., 2015 would be more similar to that of the general comments in 2016. (3) + The cluster of speech patterns of general audience in, 2015 would be more similar to that of the film critics in 2016.
Data: Rotten Tomatoes critics reviews (https://www.kaggle.com/stefanoleone992/rotten-tomatoes-movies-and-critic-reviews-dataset) Amazon Movie & TV reviews (need to filter TV) (https://nijianmo.github.io/amazon/index.html)
*The understanding of immigrant's benefit is different between academia and general public +We may see a convergence in opinion during the Trump period The difference in understanding might be quite large at all time
Data: Maybe not available to construct a comparable opinion set.
Data: Music lyrics dataset. https://www.kaggle.com/neisse/scrapped-lyrics-from-6-genres
Intuition: (1) The lyrics cluster by genre (2) The lyrics cluster by artist (3) the lyrics cluster by album
Intuitions: *1. Feminism has always been the theme of Gilmore Girls.
Dataset: http://www.crazy-internet-people.com/site/gilmoregirls/scripts.html
Intuitions
Intuitions: I expect more romantic topics in fanfiction than in source material.* I expect thematic material will be more consistent over time than the sources the fanfiction is based upon. I expect clusters to be similar across fanfiction stories for different shows.+
Data: Davies TV Corpus and fanfiction scraped using AO3 scraper script (https://github.com/radiolarian/AO3Scraper)
Intuitions:
I didn't collect the data on this because this is unrelated to my project, but it can be scraped from Munk Debates and Intelligence Squared websites.
Intuitions:
Data: Billsum dataset . Use the us_train_data_final_offical json file in the drive. NOTE: This dataset does not include information about whether the bills passed or not, so we'd have to verify that with a different dataset, so this would have to be verified with outside data.
First, write down three intuitions you have about broad content patterns you will discover in your data. Plan an asterisk next to the one you expect most firmly, and a plus next to the one that, if true, would be the biggest or most important surprise to others (especially the research community to whom you might communicate it, if robustly supported). Second, describe the dataset(s) on which you will build an unsupervised model to explore these intuitions. Then place (a) a link to the data, (b) a script to download and clean it, (c) a reference to a class dataset, (d) an invitation for a TA to contact you about it, or (e) a brief explanation why the data cannot be made available. Please do NOT spend time/space explaining the precise unsupervised strategy you will use to explore your intuitions. (Then upvote the 5 most interesting, relevant and challenging challenge responses from others).