Open JunsolKim opened 2 years ago
How does the composition of political partisanship in US Supreme Court affect its ability to generate and absorb new ideas and principles?
I plan to sample from US Supreme Court Argument Transcripts (https://www.supremecourt.gov/oral_arguments/argument_transcript/2021), downloading pdf documents. Depending on my approach, I might also assemble corresponding US Supreme Court Opinions (https://www.supremecourt.gov/opinions/slipopinion/21).
How have the topics and sentiments discussed by the activists and thought leaders of the environmental movement evolved in the last 15 years?
There are several online environmental magazines, with thousands of articles spanning this time period. These include Grist, Resilience.org, InsideClimateNews, and EMagazine. Of course, each of these sources has a slightly different focus and tone, but taken together I think they make a useful sample of environmental discourse.
How do East and West German members of the Bundestag talk about the German Democratic Republic, and are East Germans who hold positive views of the communist past proportionally represented by this?
I plan to utilize this dataset, which contains all speeches given in the German Bundestag since reunification in 1990. There might be other dataset I could to provide supplemental data such as government speeches or data from Landtage (state parliaments).
How does the composition of fact-checks affect its reception on Twitter? Which topics are more salient? What is the effect of including numerical vs. non-numerical information?
I don't yet have a dataset. I'm currently working on the reception of fact-checking in Argentina, drawing from surveys and in-depth interviews data. Since the main fact-checker in the country is mainly known for its presence on Twitter, I was considering collecting data from there (but I will need to address the bias that this mechanism may imply).
Can an algorithm classify obviously misleading tweets? Misinformation is everywhere on twitter: We have all seen it, and for the most part, it's almost too obvious that it's either a bot or somebody who's watched too much Alex Jones. What I propose is for a few people to scrape 100 tweets on controversial posts (political op-ed topics, vaccine information post from WHO or CDC, etc.) and take 50 obvious tweets of misinformation and 50 cogently written ones and see if a simple model can tell which is which.
How did sentiment toward Foreign Domestic Workers and Migrant Workers change after COVID-19 in Singapore?
I plan on scraping and spidering from local forum HardwareZone, The Straits Times, Reddit, Mothership. The varied sources is an attempt at accounting for different demographics amongst the Singaporean public.
How can the tweets of political celebrities impact company performance?
Many political celebrities mention (denounce/praise) companies in their tweets to show their own arguments, especially the former president Donald Trump. I plan on scraping tweets of political celebrities like current POTUS, party leaders and former president Trump (He was typical before his account was blocked). I also plan to incorporate text data with the stock prices of mentioned companies to investigate how those tweets impact company performance in capital markets.
How have the allegations of "sedition" changed over the years in India and the US, and what topics emerge from these? How might this be related with political regime? I plan on using cases filed under the Sedition Act and Indian Penal Code in the US and Indian supreme courts respectively. If feasible, this could be extended to federal courts too. Twitter data (at the risk of selection bias) can be used to extrapolate a more informal accusation and discourse around sedition online.
What types of part-time jobs were seeking University of Chicago students in the 20th century?
Data would be from the University archives, The Daily Maroon (available from 1900s to 1980s): https://campub.lib.uchicago.edu/search/?f1-title=Daily%20Maroon At the end of each issue, there is a section of "People Wanted" ads. These ads could be collected to form a corpus of text.
How do content and sentiment that are related to a specific hashtag change over time? What does the pattern tell us about the spread of information? I will use Twitter data by first identifying one specific hashtag (say #BLM). All tweets with identified hashtags can be scraped using tools such as TweetMoaSharp. Another approach to get my data is through Crimson Hexagon, a third-party tool. It is a commercial social media analytic service certified by Twitter. It provides full access to both current and historical Twitter data streams. If Twitter is not scrapable, I will use Reddit instead.
Do the linguistic styles of the essay-writing portion of high school and college entrance exams affect students' style of political reasonings later on?
I will use two data sources. The first one is an archive of exemplary essays selected by entrance-exams testmakers in Taiwan. The second one is a webscrape of the political "subreddit" of Dcard, a popular semi-anonymous discussion forum that's popular among college students in Taiwan.
How do mergers and acquisition affect company's culture? How do employees from merged companies adapt to new organizational cultures? And what are the consequences for their outcomes that affect company's innovation and future performance?
To answer the questions above, I will collect the employees reviews from Glassdoor . I will trace the increase/decrease of company's rate that have been assigned by employees, and then conceptualize the company's culture shift by detecting the linguistic changes before and after M&As (potential NLP techniques: sentiment analysis, cosine similarity).
Are ethnic minorities in China content with the state policies and political institutions (like government), especially compared to Han Chinese? What are the differences among different ethnic groups? What do they care about in their daily life?
I will scrape data from the corresponding forum for each minority group, use word2vec to classify posts into my pre-defined categories, and do regressions on their political attitudes.
What is the sentimental trend of Work From Home since the Covid-19? Whether the sentiment of this effects the overall emotion of the individual?
I will use the tweets with the related hashtags(eg: WFH) and dig the trend. Also, check whether the user with specific attituate with WFH while working from home changed their emotion of the other posting.
How do company responses to customer reviews affect reviews (e.g., quantity, sentiment, length of each review) afterward? I will use multiple data sources such as Yelp, Tripadvisor, and Expedia.
Why are some media influencers in Tik Talk more popular than others? Are there any common features among them or their creative content?
I try to use the tik talk API to retrieve demographic information of popular influencers, the number of followers, the click rate of the video, and other indicators of popularity. I intend to do image analysis or video analysis about their creative work to identify the common features. By employing the ML method, I intend to see whether a prediction model exists and works
How do foreign media call Taiwan in their news coverage during the pandemic--a country, an entity, or a government?
I would like to scrape the articles about Taiwan from major international media in English. I intend to find out how they call Taiwan in the different time periods/topics during the covid pandemic. Then, I can analyze if the terms indicating Taiwan are subject to the pandemic phase and the relationship between Taiwan and other countries. The further step would be finding out what kind of adjectives were used to describe Taiwan, which requires the introduction of linguistic rules.
I would like to see the research question: what is the tendency of adjudication from different arbiter or judge reviews?
This can be divided into two ways of research. First, it is possible to spot the judge's tendency among the cases they handled: what are their opinions and tendency on different types of cases? Also, are some of them more strict than their peers, or oppositely, do they incline to be more lenient? In the juridical system, the law firm is still relying on the traditional way to analyze and win a case, while I am wondering what the machine can do and change the environment in this field. The second one could be seeing the transition of adjudication with topic modeling: how do the judgment, people's interpretations, feelings, and understandings of law, etc., change over time? This question can be useful for the reformatting or revolution of the juridical system.
The textual data I will use are then adjudications/judge reviews from Federal courts and some states' courts.
My hypothesis: authoritarian regimes selectively censorship criticisms. In particular, they tolerate critiques on the government's provision of public goods but censor critiques on the leader's backsliding/deinstitutionalizing behaviors. My empirical focus is China and the text data I'll use is social media posts from Sina Weibo (the Chinese Twitter).
How does sentiment vary when reporting domestic and foreign news on social media?
I will use twitter API to get news tweets from news media accounts like CNN, ABC, NBC. I will use hashtags and probabily simple word detection to identify which country the news is related to.
Who are the target audiences of traditional wife influencers and what is a user profile of subscribers of traditional wife influencers?
It is speculated that those who watch tradwife influencers are surprisingly more than often men who subscribe to alt-right ideology instead of women who wants to be homemakers. To probe into the question, I think it'd be helpful to sample users with publicized profiles at the comment sections of tradwife influencers' twitter and tiktok accounts (as profiles on these two platforms are usually public), and then run an analysis of whom they interact most with.
What linguistic/semantic features of some fake news articles help them go viral on social media?
I plan to get a list of fake news websites, scrape articles (may be of some certain topics) and compare these articles to those from mainstream news websites. There are websites help count the number of sharing of URLs of news articles.
How do investors update their sentiments and beliefs under the influence of their social network? I plan to scrape data from social media platforms such as StockTwits and Seeking Alpha. I hope to gather two types of information - user posts and their follower networks.
Can the titile of products online identify whether they are homogenous?
I have collected a bunch of data of online products in homogeouns goods that are identified by manually checking their pictures. I am wondering whether the titles of products can make good distinctions as well? If not, will the content analysis on title and pictures together can make good classifications?
How does financial news affect the sell-side analyst making decisions about their revisions in firms' earnings forecasts?
I plan to use text data of Wall Street Journals on Proquest to develop relevant factors that have impact on analyst forecast revisions by implementing word embedding models and looking into financial literatures.
How does Universities promote themselves through news reports?
Nowadays, all universities have News departments for self-promotion and they enable people to view those historical reports. I plan to scrape all the historical articles from the NEWS section of several universities to find the general patterns and their unique strategies. Uchicago: https://news.uchicago.edu/lateststories MIT:https://news.mit.edu Harvard: https://news.harvard.edu/gazette/ Standford: https://www.stanford.edu/news/
Can we predict the factuality of reporting and bias of an online news source purely by the inherent characteristics of its website?
An online presence is a treasure trove to measure the opinion and bias of a particular entity. I am interested in finding if I could make a reasonable prediction of the trustworthiness of a website using inherent characters like- articles from the target news website, other online presence like a Wikipedia page or a twitter account, the structure of its URL, information about the web traffic it attracts etc. I would like to use fact checking APIs and other text mining tools to achieve this.
Could we predict the impacts of the financial stimulus for COVID-19 on consumer behaviors based on his/her tweets and macroeconomic variables.
We hypothesize that casual connection between spending behaviors and sentiments expressed on social media becomes clearer in the case of the cash handout in Japan. This is because this financial support policy was extensively discussed among citizens on various platforms and Twitter was not an exception. Consumers became relatively more expressive not only about the sentiment to the policy but also plans of how they were going to use the money. Also, though users on Twitter are distributed broadly among all generations up to 50s, we could assume variations in the response among generations and types of consumptions.
What is the extent to which Singaporean politicians and state-aligning political discourse cite contemporary American right-wing speech practices?
I plan to spider state official newspaper sources, press releases, published parliamentary speech and debates, and public social media posts from Singapore politicians for lexical shibboleths of contemporary right-wing lexicon, and then compare them with word trends from American right-wing sources and websites.
How did the meanings of gendered slurs change in popular music lyrics? With the rise of feminism movement globally, there's a trend of artists reclaiming gendered-insults and using them in positive light. Yet, many scholars have pointed out the harm of normalized misogyny and popularized sexualization of women. I plan to collect and analyze popular lyrics over the last five decades to understand whether there are shifts in semantics and sentiments associated with these vocabularies.
How has the integration of telemedicine/telehealth influenced the diagnosis and treatment outcomes for patients with different diseases in the United States, particularly in remote and underserved areas?
I will be using the NOW corpus (News on the Web) that contains 18.5 billion words of data from web-based newspapers and magazines from 2017 to 2023.
Post your response to our challenge question.
Pose a research question you would like to answer (in one, artfully worded sentence...ending with a question mark). This need not be the basis of your final project...but it could lead there. Then describe a collection of sources in a short (2-5 sentence) paragraph you would like to assemble, scrape, generate or spider (see this week’s code for examples) into a textual corpus that you believe will help you answer your stated question. Please do NOT spend time/space explaining how you will answer the question with the assembled corpus. (Then upvote the 5 most interesting, relevant and challenging challenge responses from others).