Open JunsolKim opened 2 years ago
Three intuitions:
The dataset:
The dataset I will be drawing on is a (very large!) csv file containing all speeches ever given in the German Bundestag. I will only analyze those speeches given after the German reunification in 1990. The data can be found here. I would also love to discuss this dataset with anyone who's interested in the project!
Intuitions:
Different genres of music will manifest themselves in a (LDA) topic model. Time-based trend should also be evident with unsupervised learning. * Gendered-based trend with interaction effects with time will be evident in clustering methods. +
On caveat, though, is that my current dataset of 190000+ popular music is not balanced across years and artists' (binary) gender. The data is available by request!
Intuitions:
1.Topics of around physical appearance and caste will emerge in year-wise documents of matrimonial ads, and this has not changed over the years.*
Data can be provided upon request, but will need to scrape advertisements for more years and compile into a meaningful corpus for topic modelling/clustering.
The corpora I am interested in is on the Coronavirus from https://www.english-corpora.org/corona/. It has detailed text records from articles and media sources that show how the language around the pandemic has evolved and changed over the course of the pandemic. With so many articles and text data, I suspect a topic model could reveal some very interesting insights about how topics have changed over the course of the pandemic.
Three intuitions:
Intuitions:
1) Infoarticles about race in Singapore will privilege concepts of harmony, with little about conflict or tension 2) There will be a noticeable absence of content related to the country's history with immigrants, with the articles instead privileging our colonial history* 3) A time period analysis should show that discourse related to our regional economic and social vulnerability will have been static over the years
Data can be provided upon request but it's not yet been sorted by year and only has about 2000 entries
Intuitions:
Data: A corpus of >100,000 blog posts and articles sourced from Resilience.org, Grist.org, InsideClimateNews.org, and EMagazine.com. Can be made available!
Three intuitions:
The dataset: Perhaps social media text like tweets, or newspapers feeding different readers (we've read a paper saying that newspapers are biased towards their major consumers)
Availability: I imagine such data is available but noisy. First, the researcher would need to translate different languages into English, and deal with potential frictions during translation. Second, if it is social media data like Twitter, we'll need another algorithm to identify those expressing political views. If it is newspaper text, then it's highly likely to be polluted by censorship.
Intuitions:
Dataset: I'm using the movies corpus available on Canvas.
Intuition:
Dataset: 2001 - 2021 congressional hearings from committee whose committee name has 'science' or 'technology' in it. The data are already partially available on my local machines, accessed through API via [https://www.govinfo.gov/ ].
Intuition:
Dataset: Public tweets of political celebrities. Will be available after getting the Twitter API. And I welcome anyone who is interested to talk about this project.
Intuitions:
Dataset: the dataset is not available. It can be super large both cross-sectional and longitudinal. Chinese social media companeis also have policies that might forbid using their data to conduct social science or political science researches.
Intuitions:
Dataset: the discussion on seven social media communities hosted by seven ethnic minority groups at the most used Chinese communication platform.
Intuitions:
Dataset: not available. Should be all Twitter data with a specific hashtag like #BLM.
Intuition:
Data: web-scraped from Sina Weibo and freeweibo.
Intuition:
Different groups of homogeneous goods on e-commerce can be detected by their names, descriptions, reviews. *Different groups of homogeneous goods on e-commerce can be detected by their images. +There is hidden dependence behind the reviews (e.g. people are incentivized by the sellers to provide reviews) and there might be ways to detect it.
Dataset: Data can be acquired by scraping. I have not scraped yet, but have codes that can be easily modified to complete the task. Happy to discuss and share upon request.
Intuition: Negative financial news tends to attract more investor attention.+ Investors tend to consume more firm-level news than macro news.* Negative financial news tends to travel faster and deeper among social networks.+
Data: Possibly twitter data through scraping. Happy to discuss and take any suggestions.
Intuition: 1. startups' pitches become more concrete over time because early-stage firms learn to adapt market and investors' expectation from firms; 2. visionary and novel pitches by influential founders are more likely to become acceptable by investors and lead to financial returns; 3. startups' pitches focus on specific products, services and sell new concept over time.
Data: cruchbase data, company's self description. Happy to share data upon request.
Intuition: *Uncertainty are grouped with macroeconomic cyclicality, as they jointly affect management and analyst forecast. +Non-financial news can be grouped by market sentiment, as there should be implied chains of impact through which different companies connect and have influence on relevant parties. Financial news can be grouped by their relevance to market expectation, i.e., whether they are relevant to increasing/decreasing expectation of corresponding firm's performances.
Data:https://www.english-corpora.org/now/; I love to talk about the pattern exploration with TAs and anyone interested!
Intuition:
Data: Need to find a way to get the Facebook data. Or Twitter data can only represent part of the content that Taiwanese want to display to the world.
Intuitions about movie reviews:
During the covid pandemic, there are more reviews and reviews become longer compared to same time periods from past years. *
During the covid pandemic, topics of reviews shift to self-oriented side compared to same time periods from past years.
During the covid pandemic, reviews are more polarized compared to same time periods from past years. +
Douban movie reviews(currently unavailable due to anti-crawler issues)https://movie.douban.com
Below is a corpus containing over 745 of 1001 public domains that have a scope of over 9.4 million articles. I would love to discuss this with a TA if anyone is interested. https://github.com/several27/FakeNewsCorpus
Intuition 1, Responses to public policies could be inferred through citizens' social media activities. 2, People become more explicit on social media about their consumption behaviros.
I will mainly seek text data from Twitter and am still in the process of Twitter API. I will see tweets with specific terms about financial stimulus of COVID-19.
Intuition:
Data: Previously scraped Incel forums, although data is still rough and unusable and am working on refining. Available on request
Data is from NOW corpus https://www.english-corpora.org/now/ from 2017 to 2023.
Post your response to our challenge questions.
First, write down three intuitions you have about broad content patterns you will discover in your data. Plan an asterisk next to the one you expect most firmly, and a plus next to the one that, if true, would be the biggest or most important surprise to others (especially the research community to whom you might communicate it, if robustly supported). Second, describe the dataset(s) on which you will build an unsupervised model to explore these intuitions. Then place (a) a link to the data, (b) a script to download and clean it, (c) a reference to a class dataset, (d) an invitation for a TA to contact you about it, or (e) a brief explanation why the data cannot be made available. Please do NOT spend time/space explaining the precise unsupervised strategy you will use to explore your intuitions. (Then upvote the 5 most interesting, relevant and challenging challenge responses from others).