UChicago-Computational-Content-Analysis / Readings-Responses-2023

1 stars 0 forks source link

7. Accounting for Context - challenge #16

Open JunsolKim opened 2 years ago

JunsolKim commented 2 years ago

Post your response to our challenge questions.

First, write down two intuitions you have about broad content patterns you will discover about your data as encoded within a pre-trained or fine-tuned deep contextual (e.g., BERT) embedding. These can be the same as those from last week...or they can evolve based on last week's explorations and the novel possibilities that emerge from dynamic, contextual embeddings--e.g., they could be about text generation from a tuned model. As before, place an asterisk next to the one you expect most firmly, and a plus next to the one that, if true, would be the biggest or most important surprise to others (especially the research community to whom you might communicate it, if robustly supported). Second, describe the dataset(s) you would like to fine-tune or embed within a pre-trained contextual embedding model to explore these intuitions. Note that this need not be large text--you could simply encode a few texts in a pretrained contextual embedding and explore their position relative to one another and the semantics of the model. Then place (a) a link to the data, (b) a script to download and clean it, (c) a reference to a class dataset, or (d) an invitation for a TA to contact you about it. Please do NOT spend time/space explaining the precise embedding or analysis strategy you will use to explore your intuitions. (Then upvote the 5 most interesting, relevant and challenging challenge responses from others).

konratp commented 2 years ago

Intuitions:

1.) In my dataset containing speeches given in the German parliament, I expect to see an increase in shorter contributions from members of parliament that could go viral online rather than long, elaborate arguments over time.

2.) I also expect members of the extreme-right AfD to be particularly affected by the trend shown above, as they often share their speeches on social media accounts in order to mobilize people behind their causes.

Data: For such an analysis, I would use the open discourse dataset containing all speeches given in the German parliament. The data can be found here and is accessible to anyone

Qiuyu-Li commented 2 years ago

Intuitions:

1*) Left and Right-skewed US media will use different words/attitude/style/focused aspects to describe Russia's invasion of Ukraine. 2+) The differences will shrink as time goes by.

Data: Tweets or articles from different media. And I imagine that similar techniques can be employed as the masked language modeling and stereotypes paper we read this week.

Jasmine97Huang commented 2 years ago

Intuitions:

1) Fine-tune BERT on individual time-slices produces better quality time-aware dynamic word embedding +

2) Using such embedding helps analyzing the changing semantics of gendered insults *

Data: music lyric dataset!

Sirius2713 commented 2 years ago

Intuitions:

  1. The sentiments of name-calling tweets on some listed companies from Trump would impose negative impact on these companies.
  2. Positive tweets had weaker effects than negative name-calling tweets.

Data: Trump tweet archive, stock price data

ValAlvernUChic commented 2 years ago

Intuitions:

  1. Mentions of FDWs in Singaporean Newspapers will be contextually tied to economic utility *
  2. These mentions would align with contemporaneous speeches from the Singaporean government about immigration and FDWs +

Dataset: NOW corpus of Singaporean news (data available) - Parliament speeches not available though

pranathiiyer commented 2 years ago

I plan on using BERT using the approach that was mentioned in this week's paper that talks about fine tuning models for multi lingual corpora. Intuitions:

  1. The mention of upper castes in matrimonial ads is related to the physical attributes of individuals and has not changed over time.*
  2. This association has changed over time+. Dataset: ads from the Indian newspaper- The Tribune. Data partly scraped.
mikepackard415 commented 2 years ago
  1. BERT may be able to detect that terms like "energy" and "environment" have radically different meanings depending on their immediate context. +
  2. The contextual embeddings will shift over time relating to the news of the day. *

Dataset: Environmental Magazine Corpus

Hongkai040 commented 2 years ago

Intuitions about short movie reviews(https://movie.douban.com):%EF%BC%9A)

1)the (perceived) gender (based on judgements of the username ) of the commenter influences the number of upvotes they receive for their movie comments. *

2)reviews are more sentimentally polarized overtime.+

Douban movie reviews. more than 4M comments available.

GabeNicholson commented 2 years ago

Using BERT to analyze the contextual embeddings in Covid news and predicting words using MASK.

  1. Lockdowns are [MASK] will have a different meaning in 2020 compared to 2022. *
  2. Covid is [MASK] will change depending on the month and year. Such as "Over", "Fake", "deadly", etc. Also, it will be interesting if the news changes their general sentiment as per case level dropping or if they continuously focus on the loom and doom as the adage goes, "If it bleeds it leads".

The dataset is the Large Covid Corpora that we now have.

Jiayu-Kang commented 2 years ago

Intuitions:

  1. The polarity of the review text affects the audience-rated helpfulness score.
  2. Reviews on movies from certain genres are more likely to use subjective adjectives.

Dataset: Amazon movie review data available here.

hshi420 commented 2 years ago

Intuitions:

  1. Rock song lyrics and Hip Hop song lyrics can be classified by BERT.*
  2. There so difference between Rock song lyrics and Hip Hop song lyrics from a computational language model perspective.+

Dataset: https://www.kaggle.com/neisse/scrapped-lyrics-from-6-genres

sizhenf commented 2 years ago

Intuitions:

  1. posts that are censored vs uncensored may be different content-wise, on the same topic
  2. some topic are more sensitive than others

data: social media data from China

LuZhang0128 commented 2 years ago

Institutions:

  1. Smaller corporations post more extreme tweets containing #BLM to get more public attention.
  2. The difference between small and big corporations becomes smaller over time

Dataset: all Tweets containing #BLM and their comments.

kelseywu99 commented 2 years ago

Intuitions:

  1. conspiracy posts in r/conspiracy uses negative nouns/adjectives to get a higher click-thru rate
  2. longer text-only post receives more comments than image only post in r/conspiracy

dataset: freshly scraped headlines from reddit's r/conspiracy

NaiyuJ commented 2 years ago

Whereas most ethnic minorities in China are content with the preferential policies, they're concerned more about networking and job-seeking in their daily life. Compared to groups without religious beliefs, minority groups with religions more frequently talk about sensitive topics in China, like terrorism, democracy, and independence.

Dataset: the discussion on seven social media communities hosted by seven ethnic minority groups at the most used Chinese communication platform.

facundosuenzo commented 2 years ago

Intuitions:

  1. The relationship between technology and the future will over time, given the particular context in which these articles appeared.
  2. The Cambridge Analytica scandal affected (could be an instance for doing a natural experiment and a diff in the diff?) the words embeddings in which different technologies like social media platforms are associated.

Dataset: NOW corpora.

sudhamshow commented 2 years ago

Intuitions:

  1. Quote-response1-response2 word predictions for masked words are going to give drastically different results based on the activity/encouragement (conditioned on likes, comments) on response1 and response 2 *
  2. A response (response 2) of aggression/sarcasm usually ensues if there is a higher activity on the original quote.

Dataset: Convokit data and reddit data scraped on altercations

isaduan commented 2 years ago

Intuitions:

  1. when projecting to larger language models, conception of democracy in authoritarian regimes is narrower than that in democratic regimes.
  2. change in the conception of democracy co-relate with change in the conception of elites and economic inequality. Dataset: Google book n-gram data between 1990 and 2010
chentian418 commented 2 years ago

Intuitions:

  1. Uncertainty are grouped with macroeconomic cyclicality, as they jointly affect management and analyst forecast.*

  2. Overall positive sentiment of value-relevant news would induce sell-side analysts to revise the earnings forecast upward.

Uncertainty and market sentiments can be trained using the BERT model for the concurrent period of news, with pre-training on all available historical data .

Data: Proquest news data and analyst and firm-level data from I/B/E/S

Emily-fyeh commented 2 years ago

Intuition:

  1. Taiwanese identities raise when the citizens sense threats on their states or have a recognizable enemy to blame. For example in the Ukrainian crisis, Taiwanese concepts become more consolidated when facing discussion on the subsequential move of China.
  2. Such changes of identities can be captured by training BERT on the latest Twitter data mentioning Ukraine and/or Taiwan. Data: Twitter data
ttsujikawa commented 2 years ago

Intuition: 1: The way of building relationship should be largely different across distinct cultural settings. Topics of conversation in the reality shows from different countries might reveal its difference. 2: Sentiment of the casts in the show should change over time and it should be more explicit in their conversation. Data: Netflix subtitle

ZacharyHinds commented 2 years ago

Intuition:

  1. Incel forum users who engage with the incel identity the most in their posts (such as through using slang) will receive the most engagement *
  2. A specific subset of Incel forum users strongly influence the sentiment and incel identity of the collective forum +

Data: Incel.is forum posts/comments