1. Sampling to Measure Meaning - Challenge

lkcao commented 10 months ago

Pose a research question you would like to answer (in one, artfully worded sentence...ending with a question mark). This need not be the basis of your final project...but it could lead there. Then describe a collection of sources in a short (2-5 sentence) paragraph you would like to assemble, scrape, generate or spider (see this week’s code for examples) into a textual corpus that you believe will help you answer your stated question. Please do NOT spend time/space explaining how you will answer the question with the assembled corpus. (Then upvote the 5 most interesting, relevant and challenging challenge responses from others).

sborislo commented 10 months ago

How do free promotions affect customer reviews and the frequency of purchases in the online video game market after the promotion ends?

I would randomly select games from Steam (an online videogame service) with (i) a minimum of 100 concurrent players in the 24-hour peak and which had (ii) a free promotion at some point in time. The customer reviews would be scraped from the pages of each game on the Steam website. The concurrent player counts would be scraped from steamcharts.com. The two resulting text datasets would be matched and combined based on the associated time data for each page. This combined dataset would be the one used for the project.

chanteriam commented 10 months ago

Research Question How has access to reproductive healthcare, particularly abortion services, in news media and policy changed before and after the Dobbs v. Jackson decision nationally, in states with the strictest abortion bans, and in states with the most lenient abortion stances?

Potential Data Sources First, I would gather information on states with the strictest and most lenient stances on abortion from sources such as Wikipedia, the Guttmacher Institute, and the Center for Reproductive Rights. Pulling samples from this list of states, I would then examine local and state government policies issued before and after the _Dobbs__ decision in those select states, as well as local news coverage of these policies and decisions. To assess national changes, I would consider supreme court cases regarding reproductive healthcare and abortive services leading up to the Dobbs decision, executive orders released in response to the Dobbs decision, and national news coverage from organizations across the political spectrum.

Audacity88 commented 10 months ago

Do humans feel less of a sense of purpose than they used to?

Of the many critiques of modernity, one of the most trenchant, but also the most difficult to quantitatize, is that modern humans lack a sense of purpose, which in the past would have been provided by religion, community or family role. I propose to examine this through an analysis of the Google Books corpus, which allows a historical analysis going back around 200 years. Moreover, it is available in nine difference languages, allowing for a comparison between languages to mitigate confounding factors from studying American English alone. Accessing the corpora is not difficult, since they are available through Google; the challenge will be comparing trends across different languages.

donatellafelice commented 10 months ago

how do the current advertisements and conversations surrounding post exposure prophylaxis for HIV compare to those of traditional HIV treatment drugs, are there any differences with how other post exposure medications (for gonorrhea or other SDIs)?

I could gather historical advertisements and articles about retrovirals, new and old pharmaceutical white papers/adds, white papers from NGOs, mentions of these drugs in public hearings, CDC/FDA releases, and popular sources like transcripts of videos found through youtube's API with drug placements or transcripts of old speeches and artistic works.

anzhichen1999 commented 10 months ago

How do online customer reviews influence the sales and reputation of small businesses in the e-commerce sector?

Customer reviews can be scraped from Amazons and e-Bays. We can also monitor social media platforms for mentions and discussions about specific products or services; analyzing trends can help, too. The sales data for a certain product can be found on Amazon.

bucketteOfIvy commented 10 months ago

RQ: How do discussions of trans people vary between the three microblogging sites of Twitter, Truth Social, and Bluesky?

Prior to data collection, a seed list of anticipated trans related keywords can be constructed using authors knowledge and experience on the platforms. These keywords can be used in conjunctions with three APIs -- Truthbrush, the AT Protocol, and the Twitter API -- to pull posts containing trans-related keywords. A random sample of pulled posts and closely read to find additional trans relate keywords, followed by an additional pull of posts based on those keywords, with this process repeating until no new keywords are found.

HamsterradYC commented 10 months ago

Research Question: Are people more likely to express sentiments of fatigue/ burnout on social media following the conclusion of mass incidents?

Based on the psychometric scale and related emotional expressions, we construct a lexicon of relevant words. We then randomly collect posts from different online social media platforms like Tweets and Facebook, Weibo, and Xiaohongshu, using specific hashtags or keywords associated with the mass incident, focusing on expressions of fatigue and other emotional states both during and after the event. Additionally, we scrape discussions from platforms like Reddit and specialized forums, where users discuss the event and share their emotional responses.

Marugannwg commented 10 months ago

RQ: What patterns of communication indicate successful posts/discussions in online learning communities?

To explore, we can scrape text from online discussion threads and posts from platforms like Slack, Discord, Github, or other websites and learning subtopics (like the subReddits on education). There could be various features that contribute to an upvoted post or a thread that generates more vibrant discussions.

chenyt16 commented 10 months ago

RQ: Do different social platforms share different patterns of hate speech (e.g., Twitter, Facebook, TikTok)?

Because the user profiles vary on each social media platform, we want to verify whether hate speech exhibits different patterns on different social media platforms. For example, hate speech targeting a particular object or group (e.g., immigrants) may be significantly more pronounced on one social platform compared to others. Text comments can be scrapped from the mentioned social platforms.

WengShihong commented 10 months ago

RQ: What is the difference(or similarity) between Chinese state media inside and outside the Firewall in terms of linguistic style? Since many Chinese state media have accounts both inside the Firewall(in Weibo) and outside the Firewall(on twitter), I can collect media content from these accounts and make comparions in linguistic styles.

zhian21 commented 10 months ago

Research question: Do parents care too much (i.e., understanding extended parental care and the evolution of overparenting in contemporary human societies)?

In today’s world, children are often seen as precious and fragile beings that require consistent protection, nourishment, and unconditional support, even after reaching adulthood. Yet, this view, or this extreme level of parental care, has rarely been studied and verified.

Data on parental beliefs or care can be collected through social media posts of parents on different platforms. Then, a threshold of overparenting will need to be defined before data analysis(e.g., how many posts of the total posts are relevant to children). Meanwhile, an in-depth analysis could be done by scraping discussions from forums related to parenting, which offers a more complete view of modern parental beliefs. After extracting the keywords or labels, we can analyze the shift of parental beliefs within recent human history.

yuzhouw313 commented 10 months ago

Research Question: To what extent does the interplay between online and offline expressions of Sinophobic sentiment reveal nuanced patterns, dynamics, and intersections?

Potential Data Source: (1) For online text data, the research could leverage YouTube comments as an under-explored yet potent source for sentiment analysis. The primary focus will be on three prominent news channels on YouTube, selected based on their popularity and political stance (conservative, liberal, and neutral). To gather this data, the YouTube API will be utilized to scrape video links within the chosen channels with some Sinophobic keywords. A random selection of videos from these links will be made, and their comments, including replies, will be extracted and subjected to sentiment analysis.

(2) For offline text data, the Bureau of Justice Statistics' (BJS) National Crime Victimization Survey (NCVS) will serve as a comprehensive source of textual data. The survey captures victims' experiences with violence, with a dedicated section specifically addressing race-induced hate crimes. This section's content will be extracted for analysis, complementing the insights gained from the online text data. By combining both online and offline sources, this research aims to provide a holistic understanding of the relationship between Sinophobic sentiment in different contexts.

ethanjkoz commented 10 months ago

Research question: What patterns emerge in online discourse surrounding transracial adoption in adoptee dominated spaces?

I will use data from 2-3 social media sites (possibly Reddit, Facebook, and Discord). I could potentially use Reddit's API for scrapping the required textual data from its site at a small cost. I am unsure for Facebook and Discord but I believe I could also learn to the necessary web scrapping techniques. I will use keywords related to identity and transracial adoption informed from previous literature on these subjects.

runlinw0525 commented 10 months ago

Research Question: How are U.S. public universities adapting their educational policies, particularly within course syllabi, to address “AI” and its associated regulations in an ethical manner, especially in guiding instructors and students?

Corpus Building: First, I will make sure that I have access to the syllabi archive of a certain U.S. public university. Then, I will be drawing syllabuses (in pdf formats) from that selected U.S. public university and turn them into a ready-to-use corpus using text cleaning for potnetial textual analysis such as sentiment analysis. This requires scraping interactive websites like the syllabi archive mentioned above.

ddlxdd commented 10 months ago

Research Question: Has there been an increase in negative sentiment expressed on social media during and after the pandemic?

I think I will focus on Reddit. I plan to randomly select 10 subreddits as my data source. Utilizing the Reddit API, I will extract posts from these subreddits corresponding to three distinct timeframes: pre-pandemic, during the pandemic, and post-pandemic. Then collect those posts, transform them into uniforms, and clean the posts other than English. Finally, I will run sentiment analysis on the posts. And it is also possible to implement topic modeling to analyze the potential topic shift caused by the pandemic and after this disaster event.

joylin0209 commented 10 months ago

Research Question: Does misogyny manifest itself in different patterns and terms on different social media platforms?

First, I will choose three social media platforms, Twitter, Instagram and Facebook, and utilize their APIs to compile posts containing misogyny. The data collection will involve specific misogynistic keywords, emojis, and commonly used hashtags. Subsequently, the textual data will be transformed into an analyzable corpus, preserving the contextual nuances of emojis and special terms. The primary emphasis will be on scrutinizing distinctions in the usage of misogynistic language across different platforms, particularly examining variations in context and attitude.

YichenDai commented 10 months ago

Can we trace the diffusion of specific health myths or misinformation across different social media forums, and how do these narratives develop?

To explore this question, I would gather data from a range of social media platforms known for health-related discussions, such as Twitter, Facebook, Reddit, and other health-focused forums. This corpus would include posts, comments, and threads specifically related to health topics, particularly those that have historically been associated with misinformation. By analyzing the language, sharing patterns, and user interactions, this dataset could reveal how misinformation spreads and evolves across different online communities.

Carolineyx commented 10 months ago

Research Question (RQ): Do people share similar or divergent concepts of what constitutes a good life across different geographical regions? Text data related to the notion of 'what is a good life' will be collected by accessing and analyzing texts in media, books, songs, and movies and other types of cultural products, under the categories of 'lifestyle' and 'good life'.

muhua-h commented 10 months ago

Research Question: How much are LLMs' ideologies influenced by the geopolitical location of where they are being trained?

Data: implicitly elicit various LLMs' understanding and views of abstract concepts (e.g., religion and political orientations) using various prompts and repeated trails. Will sample LLMs primarily trained in different languages, and if of the same primary language (e.g., English), using LLMs trained in different regions (e.g., north America vs Europe).

YucanLei commented 10 months ago

Research question: Does the application of AI help with the teachers' career?

This question comes from the survey I read before and it suggested that teachers believe the application of AI allows them to do their job easier. I would like to put this idea to a test and see if it is true. I think we can research about this by conducting survey with other teachers, or their forums, reddits, etc.

volt-1 commented 10 months ago

RQ: How have the emotions expressed in hit songs changed over the last ten years, and are today's young people more drawn to sad music?

To explore this question, I propose scraping a comprehensive textual corpus consisting of the lyrics from Billboard's Top 100 singles over the past ten years. This dataset would capture the evolving trends in popular music, offering a window into the emotional and thematic shifts in the songs that resonate most with listeners, particularly the youth. Additionally, integrating metadata such as the song's release date, genre, and artist background could provide valuable context for analysis. By leveraging advanced nlp tools like BERT, enabling analysis of the sentimental and emotional changes in popular music, potentially revealing insights into the shifting psychological landscapes of young listeners.

LyuZejian commented 10 months ago

Question: Representing the knowledge landscape in an efficient and comprehensible method, and tracing the change of landscape in automated methods.

Data: Academic graph datasets might work, like MAG, OpenAlex, WoS, etc. However, the main problem lies in how to construct a convincing representation for landscape of knowledge (like heatmap, network, word cloud, etc), how to attach meaningful information on that (like the research activities, funds input, or more conceptually, the relation between concepts), and how to develop a pragmatic method to extract it from data.

Caojie2001 commented 10 months ago

Research question: How do the employment-related factors influence the nature of inequality in the employment process?

The basic path of data processing can be roughly summarized as follows. Firstly, the employment information can be achieved from online job seeking communities. Secondly, after irrelevant information is deleted and the texts are transferred to lists of words, SVM can be used to exclude texts that are not job postings. Finally, major variables embedded in the job posting texts such as position type, name of the employer and nature of inequality can be assigned to texts by SVM models. After the dataset is constructed, further analysis would be rather less complicated.

Dededon commented 10 months ago

Research Question: How do judges reach consensus in judicial decision-making? Law researchers like to refer to a small-N world of "very pivotal cases" (from the Supreme Court) have the capacity to change the legal landscape once and forever, from their experience in legal reasoning and practice. But a jurisprudential change is not equivalent to a behavioral change. The adaptation to new jurisprudences, in both lawyering and judicial decision-making, is a process of time that worth more investigations. What are the other correlating factors besides the "jurisprudential importance" of cases, for intance, reputational factors or political inclinations, that could shape the adapation process? Here, citation behavior towards other cases could be a possible perspective to operationalize the judicial decision-making process. Data Source: The citation information and the raw text for the judge opinions are available on the website called CourtListener. I have finished the data collection through a 3-level snowball sampling with approx 37K cases in total. I'm focus on the substantive topic of administrative litigations against police departments, particularly the police misconduct cases initiated by the Supreme Court case of Monroe v. Pape.

XiaotongCui commented 10 months ago

Research Question: Fixing the location (county) and time (weeks from the vaccine’s introduction), are individuals connecting with a friend population with a higher vaccination rate less hesitant to get vaccinated? Data:

Twitter Users • Sample: All the Twitter users in the US that self-reported COVID vaccination • Sample size: 123,999 • Time span: 33 weeks (2020/11/17 - 2021/7/2) • Location: 50 states + DC • Predicted Characteristics: gender, races, ages, (income, political partisanship) 2 Twitter Connection figures • Definition of friends: the accounts a user is following • Exclude bots; identify news outlet, big names by political tendency; • Identify friends’ county. For those who only report states ⇒ assign it to the most populated county in the state 3 US County Pandemic Data • State vaccination supply per capita, population, weekly infection cases, voting partisanship

cty20010831 commented 10 months ago

My research question is related to meta-science. Specifically I am interested in how does the number of open access (specifically code and data sharing) papers in a science research field contributes to the development/health of the research field (could be operationalized as aggregate publication count, citation metrics, or journal impact factor) over time?

The data I intend to use to answer my research question would be scraped from scientific databases, including PubMed, arXiv, Web of Science. I intend to focus on several dominant areas of natural science (e.g., chemistry, biology) and engineering science (e.g., material science, computer engineering) and social science (e.g., psychology, political science). I will collect titles, abstracts, full texts, authors, publication dates, citation counts, and any other relevant metadata for data analysis.

Twilight233333 commented 10 months ago

I want to study the characteristics of different news websites in terms of movies.

I may use API to collect the reports of the same film from common mainstream websites, such as the New York Times, Washington Post, BBC, etc., and randomly sample 100 films in the last 10 years that have been reported by all three websites. Models such as sentiment analysis are then used to abstract away the specific tendencies of each major site.

erikaz1 commented 10 months ago

To what extent do different types (needs more specificity) of published academic literature make its way to the public? (Inspiration: Thomas Sowell, in Intellectuals and Society, the chapter “Filtering Reality”, claimed that what we would essentially call ‘politically correct’ findings never make it past the “intelligentsia”.) Are there patterns to the content that is being filtered? Can we figure out why or how this filtering occurs?

Data sources: Compare (size/number? of, some specific) journal databases/content of journal articles to content (reporting, discussions) in online news articles, social media forums (more communal/accessible, less-academic/professional environments).

hongste7 commented 10 months ago

What kind of language impacts sharing and purchasing behavior on social media?

I would combine commercial and regular API data from TikTok and analyze sales in relation to the comments left under respective videos. I would be interested in seeing whether certain language or phrases predicts higher sales or sharing behavior for products being sold.

QIXIN-LIN commented 10 months ago

Research Question: How have sentiment and themes in fan-fiction changed over recent years, reflecting shifts in media culture?

I intend to assemble a textual corpus from Archive of Our Own (AO3) for an extensive collection of fan-fiction works. This will be complemented by information from Wikipedia and IMDb to understand the original works that inspired these fan-fictions. This combination of sources will provide a comprehensive view of the evolving trends in fan-fiction, including shifts in sentiment and themes, set against the backdrop of mainstream media narratives.

Brian-W00 commented 10 months ago

Research Question: Can public attitudes from social media reflect economic situation?

Using Twitter API to get tweets in different periods
Using official economics data from FRED

The tweets represent the public attitudes and then try to see if there is a relation between these two

floriatea commented 10 months ago

How has the integration of telemedicine/telehealth influenced the diagnosis and treatment outcomes for patients with COVID and other diseases globally, particularly in remote and underserved areas?

I would use data from the NOW Corpus (News on the Web), date ranging from around 2017 to 2022, that mentioned any key substrings mentioned above.

yunfeiavawang commented 10 months ago

How do communities in the same niche influence each other in sentiments and topics?

I would use the dataset retrieved from Reddit. The textual contents from all the subreddits concentrating on US politics are the same but different in various ways. I would like to use topic modeling and sentiment analysis to discern the differences between each communbities.

JessicaCaishanghai commented 8 months ago

Research Question: How can we study online communities and integrate semantic meaning in our analysis?

ana-yurt commented 8 months ago

Sources I will scrape: Zhihu: Scrape text from 10 topic tags based on their occurrence rates with the Uyghur and Hui topic tags. Weibo: Scrape discussion related to the news coverage on Xinjiang Formal publication: Chinese language historiography published by semi-governmental organs

Research Questions: How does Chinese-language discourse talk about the ethnoculturally distinct Muslim populations and frontier regions in northwestern China? What can it reveal about the incomplete transition from empire to nation-state?

UChicago-Computational-Content-Analysis / Readings-Responses-2024-Winter

1. Sampling to Measure Meaning - Challenge #56