UChicago-Computational-Content-Analysis / Readings-Responses-2023

1 stars 0 forks source link

8. Conversation and Text Generation - challenge #10

Open JunsolKim opened 2 years ago

JunsolKim commented 2 years ago

Post your response to our challenge questions.

First, describe a conversation explicit within, implicit from or underlying your data. This could be the interaction between posters on a social media platform, or comments and reactions on a discussion site, or back-and-forth in a parliamentary debate, or shared stance on an issue (e.g., a stock price, political perspective), or a shared style of speech or focus, or characters within a fanfiction universe, or concepts within a discourse, or constitutions sharing ideas and phrases. Second, state two hunches you have about patterns in this conversation, with an asterisk (*) after the one about which you are most certain, and a plus (+) after the one that, if true, would be the biggest or most important surprise to others (especially the research community to whom you might communicate it, if robustly supported). Third, describe the dataset from which you will construct or extract this conversation for exploration and analysis and note whether this data could be made available to class this week for evaluation (not required...but if you offer it, you might get some free work done!) If available, place (a) a link, (b) a script (to download and/or clean), (c) a reference to a class dataset, (d) or an invitation for a TA to contact you to get it. Fourth, list in numbered steps what you would do to construct/extract the conversation from this data. Please do NOT spend time/space explaining the analytical strategy through which you would explore your conversation and consider your hunches (Then upvote the 5 most interesting, relevant and challenging challenge responses from others).

konratp commented 2 years ago

In parliamentary debates, the interactions between speakers and members of parliament in the audience who interject are interesting conversations to analyze. Oftentimes, interjections (Zwischenrufe) in the German Bundestag are used by members of parliament to express approval or disapproval for whatever the speaker is saying. These interjections are captured by scribes and are part of the official record.

Two hunches:

Dataset: The dataset I'm using comes from the open discourse project and can be found here. It includes all speeches given in the German Bundestag since 1990.

GabeNicholson commented 2 years ago

Since my main corpus is on the covid dataset, it would be interesting to pair it with youtube comments on Covid news-related videos.

Two hunches:

The youtube comment sections would need to be scraped in a systematic way that followed the same youtube channel and videos over the three different periods (2020,2021,2022).

  1. Find a youtube channel such as CNN to follow over the period
  2. use youtube API to scrape the comments of all videos containing the word "Covid, Corona, Virus, etc.)
  3. Profit
hsinkengling commented 2 years ago

I'm thinking of using data from the r/selfimprovement subreddit, which provides a question-and-answer style comments data that one can use to analyze advice-giving. The conversation is typically between the asker and answerer, but can potentially include other reddit users joining to debate the usefulness of the advice given.

my hunches:

Sirius2713 commented 2 years ago

I plan to use COVID data for the final project. It'll be interesting to pair it with financial statements from major listed companies.

Two hunches:

  1. Sentiment of financial statements from early pandemic stage is much more negative than that from recent statements. Because the supply chain was disrupted before and restored gradually lately.
  2. Companies are more prepared and therefore equipped with more covid-related policies mentioned in their financial statements.

The financial statement data can be found from the investor relationship sites of listed companies or SEC website.

Jasmine97Huang commented 2 years ago

My final corpus is music lyrics. The conversations that would be interesting are the lyrics by group artists as oppose to solo artists. My hunches are:

  1. Group artists' lyrics/conversations are more extreme (echo chamber effect). *+
  2. Group artists' lyrics are thematically more divers. *

Data available by request.

Jiayu-Kang commented 2 years ago

My project in on Amazon reviews on movies, but I'm also curious about what changes/new patterns can be discovered when users are allowed to comment&react on what others said. Such conversation could be found under the r/Movies or r/TrueFilm subreddits, where reviews and opinions on movies are presented in the form of question-answer or discussion. Hunches:

  1. The heterogeneity among users in the same conversation (i.e. word use or language patterns, or characteristics of the user) increases overtime. +
  2. The number and length of comments are associated with the mentions of names of specific actors/directors/characters in the post. *
isaduan commented 2 years ago

On the Deception in Diplomacy dataset: https://convokit.cornell.edu/documentation/diplomacy.html

  1. speaker_intention of lie vs. truth is most closely related to receiver_perception of lie vs. truth. *
  2. successful lying does not predict higher scores +
ValAlvernUChic commented 2 years ago

In the hate speech dataset: https://huggingface.co/datasets/ucberkeley-dlab/measuring-hate-speech

  1. Different types of hate speech (racism/xenophobia/misogyny) will share linguistic characteristics *
  2. These linguistic characteristics will be adopted from the group whose hate is most contemporaneous with the social environment +
NaiyuJ commented 2 years ago

The data I used here is the social media posts that I scraped from seven different Chinese online communities where ethnic minorities can post everything that happened in their life and everything that may interest them and initiate discussions. The total size of the dataset is about 70 thousand posts, and I will use this dataset for my final assignment. This dataset contains a lot of conversations among ethnic minority groups.

  1. When the conversations are targeted at national policies, the sentiment scores are mostly positive; when the conversations are targeted at local administration, the sentiment scores are mostly negative. (+)
  2. When there are multiple different ethnic minority groups involved in one conversation, the sentiment score would be mostly negative. (*)
YileC928 commented 2 years ago

I would like to perform an exploratory analysis on the GameStop short squeeze last year with Youtube and Reddit data.

  1. Speaks tend to post more extreme comments around the short squeeze.
  2. When there is more disagreement (e.g., extremism in speech), the stock price tends to be more volatile and trading volume tends to increase.
LuZhang0128 commented 2 years ago

I would like to study the contagious model based on Twitter data containing hashtag #BLM. There are two types of conversation: 1) retweet, and 2) comment. My two hunches are:

  1. The usage of extreme words, offensive words, and extreme emotions will bring more new audiences into the conversation. Meaning that those people don't have interactions with each other before yet they form a tie now.(*)
  2. Non-elite people can become the center of the discursive field by the usage of extreme words, offensive words, and extreme emotions.

I will regard each user as a participant of the conversation, and retweet/comment as conversation.

pranathiiyer commented 2 years ago
  1. There is consistency between bride seeking ads and groom seeking ads in Indian newspapers (w.r.t. caste, physical appearance). These are conversational in the sense that the requirements for one are almost the description for the other category. *
  2. There is a great inconsistency between these two categories of ads.+

Data: Ads from Indian newspaper- The Tribune

mikepackard415 commented 2 years ago

In the environmental discourse there is a big debate about economic growth. Are the ideals of sustainability and economic growth fundamentally incompatible?

Two hunches:

Data: Environmental magazines corpus

hshi420 commented 2 years ago

I would like to explore the conversations in the Machine Learning subreddit.

  1. posts about text-related techniques would get fewer replies
  2. speakers who post a question are more likely to originate from other subreddits
Hongkai040 commented 2 years ago

For Douban movie's long movie comments(https://movie.douban.com), there're many replies under popular comments. Hunches:

1)the (perceived) gender (based on judgements of the username ) of the commenter influences the number of upvotes/comments they receive for their movie comments controlling other factors. *

2)Comments under movies starring pop idols are more likely to have impolite replies. +

Dataset: currently unavailable. After modifying, a script could be used to scrape.

https://github.com/csuldw/AntSpider

sizhenf commented 2 years ago
1. posts that are censored vs uncensored may be different content-wise, on the same topic
2. some topic are more sensitive than others

data: social media data from China

Qiuyu-Li commented 2 years ago

Underlying conversation: Debates between social media users favoring and disagreeing with the China Communist Party on a particular issue (public policies, elections, foreign affairs, social news, etc...)

Hunches:

  1. We can construct a linguistic model for "the government's language" using government-controlled media such as People's Daily.
  2. This model can be applied to detecting the Internet Water Army. *+

Data: Social media data such as Sina Weibo posts.

chentian418 commented 2 years ago

I am interested in exploring the underlying conversations in the Q&A section of earnings conference calls, in which an analyst asks a question and a corporate manager answers the question.

Hunches:

  1. More aggressive questions from analysts corresponds more positive and affirmative answering of managers. +
  2. The surprising language of managers in answering can be explained by linguistic features from the analysts language. *

Data: Earning conference call transcripts from FactSet.com.

Emily-fyeh commented 2 years ago

The conversations under tweets of Chinese spokespersons are always fierce battlefields for Chinese patriotists and anti-PRC users. Hunches: *1. The dialogues do not encourage mutual understanding of both sides, the language of these commenters would only become more radical and sharp. Also, they would find s sense of belonging inside echo chambers. +2. The two sides eventually have grown synchronical language pattern which is only exclusive to the participants of these dialogues, and similar wordings or language features would not be found in other conversation arenas.

Data: Tweet data can be fetched through twint project.

ZacharyHinds commented 2 years ago

As my corpus is the Incel.is forum, it is almost entirely made up of posts and their replies.

Hunches:

    • Replies will "match" the intensity of emotion, slang, etc. of the post they are responding to
    • Posts which are unrelated to the Incel identity (use less slang or narrative phrases) will generate the most disagreement among the replies

Dataset: Incel.is forum, available upon request.

Getting conversation: Corpus is already made up of these conversations

sudhamshow commented 2 years ago

I would like to study the effect of content moderation on the change in language in Reddit Communities as a result of the moderation-

  1. I expect to see that conversations that follow a moderation event are mellow and non-controversial *
  2. It would be surprising to observe that people relatively quiet on some Reddits are more provocative on others, and might draw support from likeminded people in the former when censored in the latter +

Data: Conversations from politically aligned subreddits

ttsujikawa commented 2 years ago

My corpus is subtitles from the reality show, Terrace House. Most of corpora consist of daily conversation.

  1. People in culturally diverse community become more explicit as they get used to the community so that I expect to see dynamics in semantics has increasing trends.
  2. People behave same way either in diverse or monotonous community. Data: Netflix