Open JunsolKim opened 2 years ago
In parliamentary debates, the interactions between speakers and members of parliament in the audience who interject are interesting conversations to analyze. Oftentimes, interjections (Zwischenrufe) in the German Bundestag are used by members of parliament to express approval or disapproval for whatever the speaker is saying. These interjections are captured by scribes and are part of the official record.
Two hunches:
Dataset: The dataset I'm using comes from the open discourse project and can be found here. It includes all speeches given in the German Bundestag since 1990.
Since my main corpus is on the covid dataset, it would be interesting to pair it with youtube comments on Covid news-related videos.
Two hunches:
The youtube comment sections would need to be scraped in a systematic way that followed the same youtube channel and videos over the three different periods (2020,2021,2022).
I'm thinking of using data from the r/selfimprovement subreddit, which provides a question-and-answer style comments data that one can use to analyze advice-giving. The conversation is typically between the asker and answerer, but can potentially include other reddit users joining to debate the usefulness of the advice given.
my hunches:
I plan to use COVID data for the final project. It'll be interesting to pair it with financial statements from major listed companies.
Two hunches:
The financial statement data can be found from the investor relationship sites of listed companies or SEC website.
My final corpus is music lyrics. The conversations that would be interesting are the lyrics by group artists as oppose to solo artists. My hunches are:
Data available by request.
My project in on Amazon reviews on movies, but I'm also curious about what changes/new patterns can be discovered when users are allowed to comment&react on what others said. Such conversation could be found under the r/Movies or r/TrueFilm subreddits, where reviews and opinions on movies are presented in the form of question-answer or discussion. Hunches:
On the Deception in Diplomacy dataset: https://convokit.cornell.edu/documentation/diplomacy.html
In the hate speech dataset: https://huggingface.co/datasets/ucberkeley-dlab/measuring-hate-speech
The data I used here is the social media posts that I scraped from seven different Chinese online communities where ethnic minorities can post everything that happened in their life and everything that may interest them and initiate discussions. The total size of the dataset is about 70 thousand posts, and I will use this dataset for my final assignment. This dataset contains a lot of conversations among ethnic minority groups.
I would like to perform an exploratory analysis on the GameStop short squeeze last year with Youtube and Reddit data.
I would like to study the contagious model based on Twitter data containing hashtag #BLM. There are two types of conversation: 1) retweet, and 2) comment. My two hunches are:
I will regard each user as a participant of the conversation, and retweet/comment as conversation.
Data: Ads from Indian newspaper- The Tribune
In the environmental discourse there is a big debate about economic growth. Are the ideals of sustainability and economic growth fundamentally incompatible?
Two hunches:
Data: Environmental magazines corpus
I would like to explore the conversations in the Machine Learning subreddit.
For Douban movie's long movie comments(https://movie.douban.com), there're many replies under popular comments. Hunches:
1)the (perceived) gender (based on judgements of the username ) of the commenter influences the number of upvotes/comments they receive for their movie comments controlling other factors. *
2)Comments under movies starring pop idols are more likely to have impolite replies. +
Dataset: currently unavailable. After modifying, a script could be used to scrape.
1. posts that are censored vs uncensored may be different content-wise, on the same topic
2. some topic are more sensitive than others
data: social media data from China
Underlying conversation: Debates between social media users favoring and disagreeing with the China Communist Party on a particular issue (public policies, elections, foreign affairs, social news, etc...)
Hunches:
Data: Social media data such as Sina Weibo posts.
I am interested in exploring the underlying conversations in the Q&A section of earnings conference calls, in which an analyst asks a question and a corporate manager answers the question.
Hunches:
Data: Earning conference call transcripts from FactSet.com.
The conversations under tweets of Chinese spokespersons are always fierce battlefields for Chinese patriotists and anti-PRC users. Hunches: *1. The dialogues do not encourage mutual understanding of both sides, the language of these commenters would only become more radical and sharp. Also, they would find s sense of belonging inside echo chambers. +2. The two sides eventually have grown synchronical language pattern which is only exclusive to the participants of these dialogues, and similar wordings or language features would not be found in other conversation arenas.
Data: Tweet data can be fetched through twint project.
As my corpus is the Incel.is forum, it is almost entirely made up of posts and their replies.
Hunches:
Dataset: Incel.is forum, available upon request.
Getting conversation: Corpus is already made up of these conversations
I would like to study the effect of content moderation on the change in language in Reddit Communities as a result of the moderation-
Data: Conversations from politically aligned subreddits
My corpus is subtitles from the reality show, Terrace House. Most of corpora consist of daily conversation.
Post your response to our challenge questions.
First, describe a conversation explicit within, implicit from or underlying your data. This could be the interaction between posters on a social media platform, or comments and reactions on a discussion site, or back-and-forth in a parliamentary debate, or shared stance on an issue (e.g., a stock price, political perspective), or a shared style of speech or focus, or characters within a fanfiction universe, or concepts within a discourse, or constitutions sharing ideas and phrases. Second, state two hunches you have about patterns in this conversation, with an asterisk (*) after the one about which you are most certain, and a plus (+) after the one that, if true, would be the biggest or most important surprise to others (especially the research community to whom you might communicate it, if robustly supported). Third, describe the dataset from which you will construct or extract this conversation for exploration and analysis and note whether this data could be made available to class this week for evaluation (not required...but if you offer it, you might get some free work done!) If available, place (a) a link, (b) a script (to download and/or clean), (c) a reference to a class dataset, (d) or an invitation for a TA to contact you to get it. Fourth, list in numbered steps what you would do to construct/extract the conversation from this data. Please do NOT spend time/space explaining the analytical strategy through which you would explore your conversation and consider your hunches (Then upvote the 5 most interesting, relevant and challenging challenge responses from others).