4. Exploring Semantic Spaces - challenge

JunsolKim commented 2 years ago

Post your response to our challenge questions.

First, write down two intuitions you have about broad content patterns you will discover in your data. These can be the same as those from last week...or they can evolve based on last week's explorations and the novel possibilities that emerge from continuous, high-dimensional embeddings. As before, place an asterisk next to the one you expect most firmly, and a plus next to the one that, if true, would be the biggest or most important surprise to others (especially the research community to whom you might communicate it, if robustly supported). Note that these expectations become the basis of abduction--to condition your surprise. Second, describe the dataset(s) on which you will build an embedding model to explore these intuitions. Then place (a) a link to the data, (b) a script to download and clean it, (c) a reference to a class dataset, (d) an invitation for a TA to contact you about it, OR (e) a brief explanation why the data cannot be made available. Please do NOT spend time/space explaining the precise embedding or analysis strategy you will use to explore your intuitions. (Then upvote the 5 most interesting, relevant and challenging challenge responses from others).

konratp commented 2 years ago

Two intuitions:

The semantic networks in speeches given by East German members of the Bundestag differ significantly from those given by West German members of the Bundestag

*Over time (post-1990), the above difference will become less noticeable within parties, but East German (and) right-wing extremist parties will occupy an increasingly distinguishable semantic space.

Data:

The dataset I will be drawing on is the first corpus containing speeches given in the German Bundestag, spotty before 1990 but comprehensive after 1990. I will only analyze those speeches given after the German reunification in 1990. The data can be found here. I would also love to discuss this dataset with anyone who's interested in the project!

Jasmine97Huang commented 2 years ago

Intuitions:

Computing word embeddings for popular music lyric corpus from 1970s-2010s will allow me to analyze the shifts in the connotations of gender-slurs. I expect to see trends for such words towards neutrality, or even positivity. However, hegemonic gender associations are expected to be present overtime.

Data: Billboard and Spotify data available on request.

Qiuyu-Li commented 2 years ago

Intuitions:

The similarity or relatedness between news reports in two different countries can be used to identify big events in the two countries. The intuition is that news media from two different countries care about different problems. When they start reporting the same issue, it must be very important and has global impacts.
The auto-similarity of news media can be used to track recurring topics in history.

Data: (d) Dear TA: I'm a little bit attracted by my ideas the moment I wrote them down. Do you think they are interesting and novel enough to work on? If they are, is there any data source you would recommend for me to explore? Thank you!

isaduan commented 2 years ago

Intuitions:
1. The similarity or relatedness between news reports in two different countries can be used to identify big events in the two countries. The intuition is that news media from two different countries care about different problems. When they start reporting the same issue, it must be very important and has global impacts.

2. The auto-similarity of news media can be used to track recurring topics in history.
Data: (d) Dear TA: I'm a little bit attracted by my ideas the moment I wrote them down. Do you think they are interesting and novel enough to work on? If they are, is there any data source you would recommend for me to explore? Thank you!

You might find the GDELT project interesting! https://www.gdeltproject.org/

Intuition:

American technology policy are more closer to evidence-based terms (citing more scientific, quantitative studies) and having number-like token
New technology tends to be discussed in a negative light, and then neutralized
- American technology policy are more closer to nationalistic terms (jobs, America) than globalist terms (e.g. international, UN)

Data: https://api.govinfo.gov/docs/

sizhenf commented 2 years ago

Intuition:

The Chinese government censorship very harshly on critiques related to its leader's personalization behaviors, but tolerates critiques on its public goods provision
On topics that it censors more (personalization), there is more government propaganda, and vice versa
We expect to see volume burst when the government release news on new policies.

Data: web-scraped from Sina Weibo and freeweibo.

pranathiiyer commented 2 years ago

Intuitions:

1.Topics of around physical appearance and caste will emerge in year-wise documents of matrimonial ads, and this has not changed over the years.*

Topics around caste, and appearance fail to emerge in advertisements of more recent years. +
Topics centered around women are different from topics centered around men. Word embeddings can be used to understand what words are associated with men vs those associated with women.

Data can be provided upon request, but will need to scrape advertisements for more years and compile into a meaningful corpus.

Sirius2713 commented 2 years ago

Intuitions:

Trump's tweets mentioning company names impacted the public's attitudes toward the company. Consequently, the public attitudes would impose affect in stock market. *
The public's attitudes were polarized towards Trump's tweets. Some agreed strongly, while some disagreed strongly. Therefore, the impact of his tweets would not be always consistent with his sentiments.

Data:

the archive of Trump's tweets https://www.thetrumparchive.com/
Stock price data can be gathered through Yahoo! Finance

I'll welcome anyone who's willing to discuss this project with me.

chentian418 commented 2 years ago

I stick to the same intuitions from last week, but using different datasets.

Intuition: *Uncertainty are grouped with macroeconomic cyclicality, as they jointly affect management and analyst forecast. +Non-financial news can be grouped by market sentiment, as there should be implied chains of impact through which different companies connect and have influence on relevant parties. Financial news can be grouped by their relevance to market expectation, i.e., whether they are relevant to increasing/decreasing expectation of corresponding firm's performances.

Financial news Data: Done Jones Newswires; analyst level data: I/B/E/S

I love to talk about the model building with TAs and anyone interested!

Emily-fyeh commented 2 years ago

Intuitions:

The human rights authority (who holds the narrative of defining and interpreting international human rights) has been enlarging the meaning and coverage of human rights over years. For example, now human rights officials/scholars pay more attention to states' obligation to interfere with the human rights violation of other countries.
The new, extensive content of human rights can be observed in the documents published by UN human right institutes and NGOs.

Data Source: Country Report of Human Right Practices (U.S. Department of State) [(https://www.state.gov/reports/2020-country-reports-on-human-rights-practices/)] Universal Periodical Review Dataset

ValAlvernUChic commented 2 years ago

Intution:

1) Topics about race are hardly mentioned in Singaporean newspapers and if so, are only mentioned in either neutral or a positive context* 2) Racism will only be addressed distantly

Newspaper data available upon request!

mikepackard415 commented 2 years ago

Two intuitions:

Word embeddings trained on a corpus of online articles and blog posts written by a diverse range of people will have some significant differences to word embeddings trained on a corpus of peer-reviewed academic journal articles. *
Semantic change will be detected more significantly in a corpus of environment blogs and articles than in a corpus of environmental science academic literature. +

The environmental dataset is the same one we used last week. I don't have the academic dataset fully constructed yet.

hshi420 commented 2 years ago

Intuitions:

Topics in Chinese social media and topics in US social media.
Posts on Chinese social media and US social media may show different sentiment towards the same event. *
Posts on Chinese social media and US social media show different sentiment towards their government. Dataset: the dataset is not available. It can be super large both cross-sectional and longitudinal. Chinese social media companeis also have policies that might forbid using their data to conduct social science or political science researches.

LuZhang0128 commented 2 years ago

Intuitions:

Elite actors get involved in online social movements at a later stage, and the increasing elite participation indicates the start of the bureaucratization stage.
There exist distinguishable sub-groups focusing on different sub-topics in an online social movement's network. Those sub-groups are talking about different topics that can be tested using embedding models.
Non-elites can also be core actors in online social movements, whose tweets can cause a shift in the existing cultural pattern.

Dataset: not available. Should be all Twitter data with a specific hashtag like #BLM. Currently working on how to get all historical twitter data (its own API needs official token that is hard to get).

Jiayu-Kang commented 2 years ago

Intuitions:

In movie scripts, gendered pronouns' associations with family, career, etc., reflects gender stereotypes.
The biases are decreasing overtime.

Data: the Movies corpus available on Canvas.

YileC928 commented 2 years ago

Intuition: Negative financial news tends to attract more investor attention.+ Investors tend to consume more firm-level news than macro news.* Negative financial news tends to travel faster and deeper among social networks.

Data: Possibly twitter data through scraping. Happy to discuss and take any suggestions.

NaiyuJ commented 2 years ago

Intuitions:

Whereas most ethnic minorities in China are content with the preferential policies, they're concerned more about networking and job-seeking in their daily life. *
Compared to groups without religious beliefs, minority groups with religions more frequently talk about sensitive topics in China, like terrorism, democracy, and independence.
Ethnic minority groups that have their own languages are more content with the state and policies in contrast to groups without unique languages.

Dataset: the discussion on seven social media communities hosted by seven ethnic minority groups at the most used Chinese communication platform. An example is like this.

kelseywu99 commented 2 years ago

Intuitions: *Fake news articles use the simpler yet strong adjective to gear toward a broader sentimental audience base with or without backgrounds in higher education. +Fake news articles are about politics and national security, rather than soft news and feature stories. Fake news stories craft their own "dumbed-down" words to break down opaque political terms to their audiences despite being fake in content per se.

dataset: the fake news corpus I proposed to use last week.

Hongkai040 commented 2 years ago

Intuitions about short movie reviews(https://movie.douban.com)：

reviews upvotes could be predicted by post time and content(sentiment, relatedness to the movie, etc..) *
reviews are more polarized and self-centered overtime.+

Douban movie reviews(I found a scraper script on Github: https://github.com/csuldw/AntSpider )

chuqingzhao commented 2 years ago

Intuitions:

Self business description of early-stage companies are more likely to use terms or institutionalized words (consumer-intelligence platform, cloud-computing products) than buzz words over time.
Self business description become professional and specific with the growth of business, because their targeted audiences tend to shift from ordinary people to professional investors. Cruchbase data (scraped from wayback machine) available if requested.

facundosuenzo commented 2 years ago

Intuition:

Technological innovations (e.g., AI, Blockchain, Cryptocurrencies, Social Media Networks): a) will be generally framed negatively in news corpora across the years. b) we'll be able to see different gradients: skepticism, rejection, demonization.
Perceived social media's negative impact on society will increase over the years.*

Data: NOW corpora. (Still working on getting an optimal subsample).

ttsujikawa commented 2 years ago

Intuition: I will employ semantic analysis on scripts of the reality show and compare the results of one from the United States and one from Japan. This would allow me to reveal cultural differences in how people build relationships with each other (somehow) in reality.

Data: from Netflix "Terras House"

sudhamshow commented 2 years ago

Intuition:

Politicians follow perceivably different discourse and vocabulary usage while addressing people during different contexts. The contexts of interests are campaigning for election/ rallying support, addressing the nation in times of difficulty/ grief/national achievement / historical day, addressing opponents in the congress/parliament and addressing counterparts on an international stage. *
Albeit with a different vocabulary set, most politicians use similar words during different speech contexts and these words lie close to each other in the hyperspace (when normalised for the meaning of the word). I suspect most politicians will be using similar vocabulary (attacking the opponent during a rally, calling for unity during a tragic event etc) given a particular speech setting.
I suspect that the vocabulary used during election rallies has more fervour and greater call for action than ones delivered during a national address. +

Data: Currently still scraping, transcribing and translating speeches of various political figures

GabeNicholson commented 2 years ago

Two intuitions:

*Covid information keywords will have different contextual embeddings later in the pandemic.
+Vaccines and booster shots have different contextual embeddings associated with them, for better or worse.

The corpora I am interested in is on the Coronavirus from https://www.english-corpora.org/corona/. It has detailed text records from articles and media sources that show how the language around the pandemic has evolved and changed over the course of the pandemic. With so many articles and text data, I suspect a word embedding model could reveal some very interesting insights about how words have changed over the course of the pandemic.

AllisonXiong commented 2 years ago

Two intuitions:

*Covid information keywords will have different contextual embeddings later in the pandemic.

+Vaccines and booster shots have different contextual embeddings associated with them, for better or worse.

The corpora I am interested in is on the Coronavirus from https://www.english-corpora.org/corona/. It has detailed text records from articles and media sources that show how the language around the pandemic has evolved and changed over the course of the pandemic. With so many articles and text data, I suspect a word embedding model could reveal some very interesting insights about how words have changed over the course of the pandemic.

Similar with @Halifaxi , I'm interested in the changing contextual embeddings of covid overtime. I would focus more on the fake news during the pandemic. The intuitions are:

The contextual embeddings of covid and vaccination changes overtime, the sentiment becomes more positive;
The word embedding would vary in fake and mainstream news, as the former would attach more skepticism and conspiracy theory to covid-vaccine.

Dataset: a kaggle dataset collects some covid-related fake news articles and posts. Would like to discuss with anyone interested!

floriatea commented 4 months ago

Words that changed the most from the cleaned telehealth corpus from 2017 to 2023:

Words like 'pacifica', 'persistence', 'analytics', 'evangelist', 'telecardiology', 'mantra', 'enduser', 'wise', and 'devoted' are among those that have shown significant semantic shifts.
These words likely represent emerging concepts, technologies, or trends that have gained prominence or evolved in meaning during the period studied.
For instance, 'analytics' and 'telecardiology' might indicate a growing focus on data analysis and remote healthcare, respectively. 'Evangelist' in a modern context often refers to someone who promotes a particular technology or innovation, which could suggest an evolving role in the tech or business sectors.

Data is from purchased NOW data from 2017-2023

UChicago-Computational-Content-Analysis / Readings-Responses-2023

4. Exploring Semantic Spaces - challenge #34