Open JunsolKim opened 2 years ago
Two intuitions:
The semantic networks in speeches given by East German members of the Bundestag differ significantly from those given by West German members of the Bundestag
*Over time (post-1990), the above difference will become less noticeable within parties, but East German (and) right-wing extremist parties will occupy an increasingly distinguishable semantic space.
Data:
The dataset I will be drawing on is the first corpus containing speeches given in the German Bundestag, spotty before 1990 but comprehensive after 1990. I will only analyze those speeches given after the German reunification in 1990. The data can be found here. I would also love to discuss this dataset with anyone who's interested in the project!
Intuitions:
Computing word embeddings for popular music lyric corpus from 1970s-2010s will allow me to analyze the shifts in the connotations of gender-slurs. I expect to see trends for such words towards neutrality, or even positivity. However, hegemonic gender associations are expected to be present overtime.
Data: Billboard and Spotify data available on request.
Intuitions:
Data: (d) Dear TA: I'm a little bit attracted by my ideas the moment I wrote them down. Do you think they are interesting and novel enough to work on? If they are, is there any data source you would recommend for me to explore? Thank you!
Intuitions:
1. The similarity or relatedness between news reports in two different countries can be used to identify big events in the two countries. The intuition is that news media from two different countries care about different problems. When they start reporting the same issue, it must be very important and has global impacts. 2. The auto-similarity of news media can be used to track recurring topics in history.
Data: (d) Dear TA: I'm a little bit attracted by my ideas the moment I wrote them down. Do you think they are interesting and novel enough to work on? If they are, is there any data source you would recommend for me to explore? Thank you!
You might find the GDELT project interesting! https://www.gdeltproject.org/
Intuition:
Intuition:
Data: web-scraped from Sina Weibo and freeweibo.
Intuitions:
1.Topics of around physical appearance and caste will emerge in year-wise documents of matrimonial ads, and this has not changed over the years.*
Data can be provided upon request, but will need to scrape advertisements for more years and compile into a meaningful corpus.
Intuitions:
Data:
I'll welcome anyone who's willing to discuss this project with me.
I stick to the same intuitions from last week, but using different datasets.
Intuition: *Uncertainty are grouped with macroeconomic cyclicality, as they jointly affect management and analyst forecast. +Non-financial news can be grouped by market sentiment, as there should be implied chains of impact through which different companies connect and have influence on relevant parties. Financial news can be grouped by their relevance to market expectation, i.e., whether they are relevant to increasing/decreasing expectation of corresponding firm's performances.
Financial news Data: Done Jones Newswires; analyst level data: I/B/E/S
I love to talk about the model building with TAs and anyone interested!
Intuitions:
Data Source: Country Report of Human Right Practices (U.S. Department of State) [(https://www.state.gov/reports/2020-country-reports-on-human-rights-practices/)] Universal Periodical Review Dataset
Intution:
1) Topics about race are hardly mentioned in Singaporean newspapers and if so, are only mentioned in either neutral or a positive context* 2) Racism will only be addressed distantly
Newspaper data available upon request!
Two intuitions:
Word embeddings trained on a corpus of online articles and blog posts written by a diverse range of people will have some significant differences to word embeddings trained on a corpus of peer-reviewed academic journal articles. *
Semantic change will be detected more significantly in a corpus of environment blogs and articles than in a corpus of environmental science academic literature. +
The environmental dataset is the same one we used last week. I don't have the academic dataset fully constructed yet.
Intuitions:
Intuitions:
Dataset: not available. Should be all Twitter data with a specific hashtag like #BLM. Currently working on how to get all historical twitter data (its own API needs official token that is hard to get).
Intuitions:
Data: the Movies corpus available on Canvas.
Intuition: Negative financial news tends to attract more investor attention.+ Investors tend to consume more firm-level news than macro news.* Negative financial news tends to travel faster and deeper among social networks.
Data: Possibly twitter data through scraping. Happy to discuss and take any suggestions.
Intuitions:
Dataset: the discussion on seven social media communities hosted by seven ethnic minority groups at the most used Chinese communication platform. An example is like this.
Intuitions: *Fake news articles use the simpler yet strong adjective to gear toward a broader sentimental audience base with or without backgrounds in higher education. +Fake news articles are about politics and national security, rather than soft news and feature stories. Fake news stories craft their own "dumbed-down" words to break down opaque political terms to their audiences despite being fake in content per se.
dataset: the fake news corpus I proposed to use last week.
Intuitions about short movie reviews(https://movie.douban.com):
reviews upvotes could be predicted by post time and content(sentiment, relatedness to the movie, etc..) *
reviews are more polarized and self-centered overtime.+
Douban movie reviews(I found a scraper script on Github: https://github.com/csuldw/AntSpider )
Intuitions:
Intuition:
Data: NOW corpora. (Still working on getting an optimal subsample).
Intuition: I will employ semantic analysis on scripts of the reality show and compare the results of one from the United States and one from Japan. This would allow me to reveal cultural differences in how people build relationships with each other (somehow) in reality.
Data: from Netflix "Terras House"
Intuition:
Data: Currently still scraping, transcribing and translating speeches of various political figures
Two intuitions:
The corpora I am interested in is on the Coronavirus from https://www.english-corpora.org/corona/. It has detailed text records from articles and media sources that show how the language around the pandemic has evolved and changed over the course of the pandemic. With so many articles and text data, I suspect a word embedding model could reveal some very interesting insights about how words have changed over the course of the pandemic.
Two intuitions:
- *Covid information keywords will have different contextual embeddings later in the pandemic.
- +Vaccines and booster shots have different contextual embeddings associated with them, for better or worse.
The corpora I am interested in is on the Coronavirus from https://www.english-corpora.org/corona/. It has detailed text records from articles and media sources that show how the language around the pandemic has evolved and changed over the course of the pandemic. With so many articles and text data, I suspect a word embedding model could reveal some very interesting insights about how words have changed over the course of the pandemic.
Similar with @Halifaxi , I'm interested in the changing contextual embeddings of covid overtime. I would focus more on the fake news during the pandemic. The intuitions are:
Dataset: a kaggle dataset collects some covid-related fake news articles and posts. Would like to discuss with anyone interested!
Words that changed the most from the cleaned telehealth corpus from 2017 to 2023:
Data is from purchased NOW data from 2017-2023
Post your response to our challenge questions.
First, write down two intuitions you have about broad content patterns you will discover in your data. These can be the same as those from last week...or they can evolve based on last week's explorations and the novel possibilities that emerge from continuous, high-dimensional embeddings. As before, place an asterisk next to the one you expect most firmly, and a plus next to the one that, if true, would be the biggest or most important surprise to others (especially the research community to whom you might communicate it, if robustly supported). Note that these expectations become the basis of abduction--to condition your surprise. Second, describe the dataset(s) on which you will build an embedding model to explore these intuitions. Then place (a) a link to the data, (b) a script to download and clean it, (c) a reference to a class dataset, (d) an invitation for a TA to contact you about it, OR (e) a brief explanation why the data cannot be made available. Please do NOT spend time/space explaining the precise embedding or analysis strategy you will use to explore your intuitions. (Then upvote the 5 most interesting, relevant and challenging challenge responses from others).