3. Discovering Higher-Level Patterns - challenge

JunsolKim commented 2 years ago

Post your response to our challenge questions.

First, write down three intuitions you have about broad content patterns you will discover in your data. Plan an asterisk next to the one you expect most firmly, and a plus next to the one that, if true, would be the biggest or most important surprise to others (especially the research community to whom you might communicate it, if robustly supported). Second, describe the dataset(s) on which you will build an unsupervised model to explore these intuitions. Then place (a) a link to the data, (b) a script to download and clean it, (c) a reference to a class dataset, (d) an invitation for a TA to contact you about it, or (e) a brief explanation why the data cannot be made available. Please do NOT spend time/space explaining the precise unsupervised strategy you will use to explore your intuitions. (Then upvote the 5 most interesting, relevant and challenging challenge responses from others).

konratp commented 2 years ago

Three intuitions:

East German members of the Bundestag are more prone to discussing the German Democratic Republic (GDR) in positive terms.
Parties of the so-called democratic center (i.e. excluding the far-right and the far-left) employ a much more critical language of the GDR than those at the political extremes.
*The more time passes since the German reunification, the greater the divergence in sentiments towards the GDR between parties of the democratic center vis-a-vis the center becomes.

The dataset:

The dataset I will be drawing on is a (very large!) csv file containing all speeches ever given in the German Bundestag. I will only analyze those speeches given after the German reunification in 1990. The data can be found here. I would also love to discuss this dataset with anyone who's interested in the project!

Jasmine97Huang commented 2 years ago

Intuitions:

Different genres of music will manifest themselves in a (LDA) topic model. Time-based trend should also be evident with unsupervised learning. * Gendered-based trend with interaction effects with time will be evident in clustering methods. +

On caveat, though, is that my current dataset of 190000+ popular music is not balanced across years and artists' (binary) gender. The data is available by request!

pranathiiyer commented 2 years ago

Intuitions:

1.Topics of around physical appearance and caste will emerge in year-wise documents of matrimonial ads, and this has not changed over the years.*

Topics around caste, and appearance fail to emerge in advertisements of more recent years. +
Topics centered around women are different from topics centered around men.

Data can be provided upon request, but will need to scrape advertisements for more years and compile into a meaningful corpus for topic modelling/clustering.

GabeNicholson commented 2 years ago

The corpora I am interested in is on the Coronavirus from https://www.english-corpora.org/corona/. It has detailed text records from articles and media sources that show how the language around the pandemic has evolved and changed over the course of the pandemic. With so many articles and text data, I suspect a topic model could reveal some very interesting insights about how topics have changed over the course of the pandemic.

Three intuitions:

.* More hostile language associated with lockdowns in 2021 compared to 2020.
The way people speak about masks and other pandemic related items will have changed and a topic model will be able to summarize the largest changes by evaluating topics over time. ++ Vaccines are spoken about more now compared to when they were released.

ValAlvernUChic commented 2 years ago

Intuitions:

1) Infoarticles about race in Singapore will privilege concepts of harmony, with little about conflict or tension 2) There will be a noticeable absence of content related to the country's history with immigrants, with the articles instead privileging our colonial history* 3) A time period analysis should show that discourse related to our regional economic and social vulnerability will have been static over the years

Data can be provided upon request but it's not yet been sorted by year and only has about 2000 entries

mikepackard415 commented 2 years ago

Intuitions:

References to scientific literature in environmental discourse will increasingly include the social sciences. +
The discourse will diversify in terms of the number of distinct topics over time. *
The amount and intensity of urgent calls to action will increase over time.

Data: A corpus of >100,000 blog posts and articles sourced from Resilience.org, Grist.org, InsideClimateNews.org, and EMagazine.com. Can be made available!

Qiuyu-Li commented 2 years ago

Three intuitions:

People with the lowest income level hold different political views, dependent on the political/economic characteristics of the country. But the within-country similarity is high.
As income (or perhaps education level) increases, the cross-country similarity will increase, while within-county similarity decreases. 3*+. A country with a low level of within-country similarity in political views is more democratic.

The dataset: Perhaps social media text like tweets, or newspapers feeding different readers (we've read a paper saying that newspapers are biased towards their major consumers)

Availability: I imagine such data is available but noisy. First, the researcher would need to translate different languages into English, and deal with potential frictions during translation. Second, if it is social media data like Twitter, we'll need another algorithm to identify those expressing political views. If it is newspaper text, then it's highly likely to be polluted by censorship.

Jiayu-Kang commented 2 years ago

Intuitions:

Topics in movie scripts change overtime.*
Movie genres are evident in topic model.
There is more word use related to explicit emotional expressions in recent movies.+

Dataset: I'm using the movies corpus available on Canvas.

isaduan commented 2 years ago

Intuition:

Policymaking relies more on imagery and normative frames than facts and evidence, even in science & technology policy. +
Policymakers' reception of expert testimony (i.e. the extent to which they update their beliefs or incorporate new ideas from experts) is politicized.
Policymakers have been adopting a more nationalistic frame of science. *

Dataset: 2001 - 2021 congressional hearings from committee whose committee name has 'science' or 'technology' in it. The data are already partially available on my local machines, accessed through API via [https://www.govinfo.gov/ ].

Sirius2713 commented 2 years ago

Intuition:

People with the same political preference will be impacted by the tweets of the political celebrities in that party. *
People's reaction will impact the stock market.
Political celebrities can impose impact on stock markets by just tweeting.

Dataset: Public tweets of political celebrities. Will be available after getting the Twitter API. And I welcome anyone who is interested to talk about this project.

hshi420 commented 2 years ago

Intuitions:

Topics in Chinese social media and topics in US social media.
Posts on Chinese social media and US social media may show different sentiment towards the same event. *
Posts on Chinese social media and US social media show different sentiment towards their government.

Dataset: the dataset is not available. It can be super large both cross-sectional and longitudinal. Chinese social media companeis also have policies that might forbid using their data to conduct social science or political science researches.

NaiyuJ commented 2 years ago

Intuitions:

Whereas most ethnic minorities in China are content with the preferential policies, they're concerned more about networking and job-seeking in their daily life. *
Compared to groups without religious beliefs, minority groups with religions more frequently talk about sensitive topics in China, like terrorism, democracy, and independence.
Ethnic minority groups that have their own languages are more content with the state and policies in contrast to groups without unique languages.

Dataset: the discussion on seven social media communities hosted by seven ethnic minority groups at the most used Chinese communication platform.

LuZhang0128 commented 2 years ago

Intuitions:

Elite actors get involved in online social movements at a later stage, and the increasing elite participation indicates the start of the bureaucratization stage.
There exist distinguishable sub-groups focusing on different sub-topics in an online social movement's network.
Non-elites can also be core actors in online social movements, whose tweets can cause a shift in the existing cultural pattern.

Dataset: not available. Should be all Twitter data with a specific hashtag like #BLM.

sizhenf commented 2 years ago

Intuition:

The Chinese government censorship very harshly on critiques related to its leader's personalization behaviors, but tolerates critiques on its public goods provision +
On topics that it censors more (personalization), there is more government propaganda, and vice versa *
We expect to see volume burst when the government release news on new policies.

Data: web-scraped from Sina Weibo and freeweibo.

zixu12 commented 2 years ago

Intuition:

Different groups of homogeneous goods on e-commerce can be detected by their names, descriptions, reviews. *Different groups of homogeneous goods on e-commerce can be detected by their images. +There is hidden dependence behind the reviews (e.g. people are incentivized by the sellers to provide reviews) and there might be ways to detect it.

Dataset: Data can be acquired by scraping. I have not scraped yet, but have codes that can be easily modified to complete the task. Happy to discuss and share upon request.

YileC928 commented 2 years ago

Intuition: Negative financial news tends to attract more investor attention.+ Investors tend to consume more firm-level news than macro news.* Negative financial news tends to travel faster and deeper among social networks.+

Data: Possibly twitter data through scraping. Happy to discuss and take any suggestions.

chuqingzhao commented 2 years ago

Intuition: 1. startups' pitches become more concrete over time because early-stage firms learn to adapt market and investors' expectation from firms; 2. visionary and novel pitches by influential founders are more likely to become acceptable by investors and lead to financial returns; 3. startups' pitches focus on specific products, services and sell new concept over time.

Data: cruchbase data, company's self description. Happy to share data upon request.

chentian418 commented 2 years ago

Intuition: *Uncertainty are grouped with macroeconomic cyclicality, as they jointly affect management and analyst forecast. +Non-financial news can be grouped by market sentiment, as there should be implied chains of impact through which different companies connect and have influence on relevant parties. Financial news can be grouped by their relevance to market expectation, i.e., whether they are relevant to increasing/decreasing expectation of corresponding firm's performances.

Data:https://www.english-corpora.org/now/; I love to talk about the pattern exploration with TAs and anyone interested!

Emily-fyeh commented 2 years ago

Intuition:

The global perspectives of Taiwan citizens fluctuate along with the cross-strait relationship.
Netizens resonate more with the close-to-life and meme-like materials separating the idea of China and Taiwan.
Taiwanese show more sense of identity (online) when being suppressed internationally.

Data: Need to find a way to get the Facebook data. Or Twitter data can only represent part of the content that Taiwanese want to display to the world.

Hongkai040 commented 2 years ago

Intuitions about movie reviews：

During the covid pandemic, there are more reviews and reviews become longer compared to same time periods from past years. *

During the covid pandemic, topics of reviews shift to self-oriented side compared to same time periods from past years.

During the covid pandemic, reviews are more polarized compared to same time periods from past years. +

Douban movie reviews(currently unavailable due to anti-crawler issues)https://movie.douban.com

kelseywu99 commented 2 years ago

*fake news uses more keywords for SEO purposes and click-through rates
fake news' headlines contain more exclamation marks than credible news' headlines
+fake news mentions more about the Democratic party and its Party member

Below is a corpus containing over 745 of 1001 public domains that have a scope of over 9.4 million articles. I would love to discuss this with a TA if anyone is interested. https://github.com/several27/FakeNewsCorpus

ttsujikawa commented 2 years ago

Intuition 1, Responses to public policies could be inferred through citizens' social media activities. 2, People become more explicit on social media about their consumption behaviros.

I will mainly seek text data from Twitter and am still in the process of Twitter API. I will see tweets with specific terms about financial stimulus of COVID-19.

ZacharyHinds commented 2 years ago

Intuition:

Incel forum posts with more responses use more Incel-specific slang/terms *
Longer incel posts will trend towards more intense or violent topics
Specific users' posts on the forum will trend towards intensity over time +

Data: Previously scraped Incel forums, although data is still rough and unusable and am working on refining. Available on request

floriatea commented 4 months ago

Some countries like Kenya, Malaysia, and Zambia have more documents related to Technology, while Tanzania and India have more on Abortion, State Law, and Women's Health. This could reflect different healthcare priorities or regulatory environments. *
US and Canada have topics with emphasis on "dollar," "company," "stock," "quarter," "revenue" implies a focus on the financial performance and investment aspects of companies involved in telehealth.
Some topics appear consistently across all countries, such as generic topics with terms like "health," "people," "say", suggests that certain aspects of telehealth are universally discussed or have global relevance.

Data is from NOW corpus https://www.english-corpora.org/now/ from 2017 to 2023.

UChicago-Computational-Content-Analysis / Readings-Responses-2023

3. Discovering Higher-Level Patterns - challenge #40