Open JunsolKim opened 2 years ago
Intuitions:
1.) In my dataset containing speeches given in the German parliament, I expect to see an increase in shorter contributions from members of parliament that could go viral online rather than long, elaborate arguments over time.
2.) I also expect members of the extreme-right AfD to be particularly affected by the trend shown above, as they often share their speeches on social media accounts in order to mobilize people behind their causes.
Data: For such an analysis, I would use the open discourse dataset containing all speeches given in the German parliament. The data can be found here and is accessible to anyone
Intuitions:
1*) Left and Right-skewed US media will use different words/attitude/style/focused aspects to describe Russia's invasion of Ukraine. 2+) The differences will shrink as time goes by.
Data: Tweets or articles from different media. And I imagine that similar techniques can be employed as the masked language modeling and stereotypes paper we read this week.
Intuitions:
1) Fine-tune BERT on individual time-slices produces better quality time-aware dynamic word embedding +
2) Using such embedding helps analyzing the changing semantics of gendered insults *
Data: music lyric dataset!
Intuitions:
Data: Trump tweet archive, stock price data
Intuitions:
Dataset: NOW corpus of Singaporean news (data available) - Parliament speeches not available though
I plan on using BERT using the approach that was mentioned in this week's paper that talks about fine tuning models for multi lingual corpora. Intuitions:
Dataset: Environmental Magazine Corpus
Intuitions about short movie reviews(https://movie.douban.com):%EF%BC%9A)
1)the (perceived) gender (based on judgements of the username ) of the commenter influences the number of upvotes they receive for their movie comments. *
2)reviews are more sentimentally polarized overtime.+
Douban movie reviews. more than 4M comments available.
Using BERT to analyze the contextual embeddings in Covid news and predicting words using MASK.
The dataset is the Large Covid Corpora that we now have.
Intuitions:
Dataset: Amazon movie review data available here.
Intuitions:
Dataset: https://www.kaggle.com/neisse/scrapped-lyrics-from-6-genres
Intuitions:
data: social media data from China
Institutions:
Dataset: all Tweets containing #BLM and their comments.
Intuitions:
dataset: freshly scraped headlines from reddit's r/conspiracy
Whereas most ethnic minorities in China are content with the preferential policies, they're concerned more about networking and job-seeking in their daily life. Compared to groups without religious beliefs, minority groups with religions more frequently talk about sensitive topics in China, like terrorism, democracy, and independence.
Dataset: the discussion on seven social media communities hosted by seven ethnic minority groups at the most used Chinese communication platform.
Intuitions:
Dataset: NOW corpora.
Intuitions:
Dataset: Convokit data and reddit data scraped on altercations
Intuitions:
Intuitions:
Uncertainty are grouped with macroeconomic cyclicality, as they jointly affect management and analyst forecast.*
Overall positive sentiment of value-relevant news would induce sell-side analysts to revise the earnings forecast upward.
Uncertainty and market sentiments can be trained using the BERT model for the concurrent period of news, with pre-training on all available historical data .
Data: Proquest news data and analyst and firm-level data from I/B/E/S
Intuition:
Intuition: 1: The way of building relationship should be largely different across distinct cultural settings. Topics of conversation in the reality shows from different countries might reveal its difference. 2: Sentiment of the casts in the show should change over time and it should be more explicit in their conversation. Data: Netflix subtitle
Intuition:
Data: Incel.is forum posts/comments
Post your response to our challenge questions.
First, write down two intuitions you have about broad content patterns you will discover about your data as encoded within a pre-trained or fine-tuned deep contextual (e.g., BERT) embedding. These can be the same as those from last week...or they can evolve based on last week's explorations and the novel possibilities that emerge from dynamic, contextual embeddings--e.g., they could be about text generation from a tuned model. As before, place an asterisk next to the one you expect most firmly, and a plus next to the one that, if true, would be the biggest or most important surprise to others (especially the research community to whom you might communicate it, if robustly supported). Second, describe the dataset(s) you would like to fine-tune or embed within a pre-trained contextual embedding model to explore these intuitions. Note that this need not be large text--you could simply encode a few texts in a pretrained contextual embedding and explore their position relative to one another and the semantics of the model. Then place (a) a link to the data, (b) a script to download and clean it, (c) a reference to a class dataset, or (d) an invitation for a TA to contact you about it. Please do NOT spend time/space explaining the precise embedding or analysis strategy you will use to explore your intuitions. (Then upvote the 5 most interesting, relevant and challenging challenge responses from others).