2. Counting Words & Phrases to Trace the Distribution of Meaning -Challenge

lkcao commented 10 months ago

Post your response to our challenge questions.

Articulate a one-sentence computational linguistics hunch or hypothesis regarding the distribution of words, phrases or parsed claims within your corpus relative to some variable (e.g., time, city size, number of likes), between your corpora, or between your corpus and some linguistic baseline (e.g., all current Wikipedia articles; a sample of 2020 news articles; French tweets from 2016 Paris). This need not be critical to your final project...but it could lead there. Next, in a short (2-5 sentence) paragraph, describe why you reason this hunch or hypothesis might be correct. Finally, list the corpus or corpora on which you will test it, and mention whether it could be made available to class this week for evaluation (not required...but if you offer it, you might get some free work done!) Please do NOT spend time/space explaining how you will explore your hunch or validate your hypothesis with the mentioned corpus. (Then upvote the 5 most interesting, relevant and challenging challenge responses from others).

Twilight233333 commented 10 months ago

Hypothesis: A textual analysis of the social media of OPEC staff can predict OPEC's actions on oil production

Text features appearing in news headlines based on different topics (related to the environment or related to politics) are related to oil prices, so it is possible to predict oil price changes through thematic modeling. For example, before an oil production adjustment, the social media posts of OPEC officials may be more politically related.

If it were to be done, it would likely involve extracting the text of tweets individually based on a list of officials.

Dededon commented 10 months ago

Hypothesis: Can we find changing in linguistic patterns in the judicial language regarding police misconducts?

Reasoning: Like social language, judicial language are of course subject to change, along with the political environment and development of jurisprudences. Even in the police-citizen conflicts with similar context in social actions, the judicial languages may vary in using of the descriptive terms, and actions. From the civil rights era till then, how do the judicial actors perceive the relationship between police, police departments and citizens in conflicts differently? Here, Stuhler (2022)'s method might be the most relevant one to analyze such RQ.

Corpus: I use the CourtListener API to collect around 22K administrative lawsuits against police departments.

anzhichen1999 commented 10 months ago

Hypothesis: The tone and frequency of positive foreign events coverage in Chinese party newspapers are predictive indicators of the Chinese government's commitment to maintaining and enhancing the Open-Up Policy.

Corpus: People's Daily Newspaper

Reasoning: This hypothesis is predicated on the idea that state-controlled media in China, particularly party newspapers, reflect and possibly signal government policy intentions. If these newspapers increasingly report on positive foreign events, especially those aligning with China's Open Up Policy principles, it could imply a governmental inclination towards further international engagement and economic liberalization.

sborislo commented 10 months ago

Hypothesis: Words associated with violence (e.g., "war") would be more likely to be found in game titles with larger player bases.

Rationale: A key factor affecting people's motivations to play video games is engagement. Real life can often be boring or depressing, so video games are frequently used to turn one's focus away from real-life happenings. Violent media content has been shown to, on average, be more engaging than non-violent media content, so violent associations likely signal more engaging games. Additionally, violent environments have been shown to help fulfill certain needs, like autonomy (feeling in control of one's life) and comradery (since violent games often involve needing help and/or helping others). This likely contributes to the appeal of violent connotations as well.

Data: The steamcharts.com webpage, which tracks player counts and game titles.

XiaotongCui commented 10 months ago

Hunch/Hypothesis: I propose that the sentiment and tone of news articles across different media outlets exhibit discernible patterns based on the political orientation of the outlet, with a hypothesis that conservative-leaning sources might employ more assertive and critical language, while liberal-leaning sources may adopt more nuanced and empathetic expressions.

Reasoning: This hunch is grounded in the idea that media outlets with distinct political orientations tend to frame news events through different lenses, emphasizing particular aspects of a story and employing language that aligns with their editorial stance. Studies have indicated that media bias can influence the framing of news, and this hypothesis extends that idea to linguistic expression, assuming that political leanings may shape the emotional and rhetorical elements in news coverage.

Corpus for Testing: NOW corpus. A diverse collection of news articles from major media outlets with varying political affiliations, including conservative sources like Fox News, liberal sources like MSNBC, and centrist sources like CNN. This corpus would cover a range of topics and time periods.

chanteriam commented 10 months ago

Hypothesis: Between 1973 and 2022, U.S. court case rulings and policy regarding abortion access have begun to include more arguments regarding the morality of the practice, reflecting arguments made in popular news media and widening polarization in the United States.

Reasoning: The 1973 Roe v Wade SCOTUS decision based their argument around the right to personal privacy, particularly between and individual and their medical practitioner. In contrast, the 2022 Dobbs v. Jackson decision overturned this ruiling, shifting the right of abortion to be determined back to the individual states, and including arguments around the viability of a fetus and 'right to life' protections. These changes could reflect, at least, linguistic influences from news media sources into the policy realm.

Corpuse: SCOTUS ruilings on abortion; LexisNexis for news media sources.

yuzhouw313 commented 10 months ago

Hunch: Given similar popularity and content volume during the outbreak of COVID-19, videos from conservative news channels (e.g. Fox News) will exhibit a higher propensity for the presence of Sinophobic terms in the comment section. In contrast, videos from liberal news channels (e.g. MSNBC) are expected to demonstrate a more diverse spectrum of attitudes toward China and Chinese.

Reasoning: This linguistic hunch is based on the well-established result that the political orientation of news channels can shape the tone and content of user-generated comments. Conservative news channels may foster an environment where Sinophobic sentiments find resonance due to their editorial stance or audience composition. In fact, after manually browsing through Fox News videos about COVID-19, I noticed that there were many news clips depicting the pandemic as a bioweapon conspiracy or explicitly condemning China with inflammatory terms, which might catalyze a Sinophobic dynamic within their comment sections. In contrast, liberal news channels, known for their emphasis on diversity and inclusion, are expected to host a broader range of attitudes in their comment sections.

Corpus: A collection of comments scraped from YouTube news channels (both conservative and liberal) using its official API.

joylin0209 commented 10 months ago

Reasoning: In the past, Taiwan's political environment was dominated by two major parties: the Democratic Progressive Party and the Kuomintang. Most of the former group members have promoted social movements and emphasized democratic values in the past, with particular emphasis on "resistance to China" and "Taiwan sovereignty." Therefore, it is speculated that "Taiwan sovereignty," as the party's core value, should be particularly emphasized before the election. Since the Democratic Progressive Party was re-elected as president four years ago, the Kuomintang, one of the two major parties, should emphasize "party rotation." In recent years, more and more young people have become disillusioned with the two major parties and have turned to smaller parties, one of which is the People's Party. Therefore, it is expected that the People's Party will stand in the original position of the middle voters and criticize the blue-green camp for being just as bad. (Blue = Kuomintang, Green = Democratic Progressive Party)

Data: The data sources are public posts and external statements of the three major parties in the past six months.

naivetoad commented 10 months ago

Hypothesis: In Linkedin job postings, there is a higher frequency of industry-specific jargon and technical terms for specialized fields, like IT and engineering, compared to more generalized fields, like customer service and marketing.

Specialized fields often require specific technical skills and knowledge, which are usually described using field-specific terminology. Employers in specialized fields also often seek candidates with very specific qualifications, which they might describe using technical language. In contrast, more generalized fields might use broader language that is understandable to a wider audience.

I would use the corpus of Linkedin job listings, categorizing them into specialized and generalized fields based on the job title and industry.

ethanjkoz commented 10 months ago

Hypothesis: there are significant difference in the sentiment towards adoption among adoptees compared to other persons affected by adoption in online spaces. Particularly, online spaces oriented towards "adoptees only" may do well to serve as a echo chamber for anti-adoption sentiment, whereas a more general space, such as r/Adoption may have much less negative sentiment. Furthermore, those within a community they might deem supportive might be more likely to share their feelings in spaces such as these. Though the anonymity provided by the internet may already do that as well. There might be some challenges as to the amount of data collected one sub is much larger than the other.

Corpus scrapped from subreddits dedicated to discussing adoption (r/Adoption vs r/Adopted), categorizing sentiment towards adoption.

Audacity88 commented 10 months ago

Hypothesis: The amount of self-centered expression in books has increased over time.

Reasoning: Since the rise of liberalism, with its commitments to individual rights and autonomy, in the 17th century, humans have enjoyed greater freedom to pursue their own ends than in the past. However, one potential downside of this freedom is a decrease in the importance of family, community, and state ties. Due to this, I hypothesize that modern humans consider themselves more, using more phrases such as "my house", and fewer ones like "our town", than humans in the past.

Corpus: Google Ngrams.

ana-yurt commented 10 months ago

Hypothesis: The amount phrases related to time perception has expanded in the past two centuries.

Since the late eighteenth century, the spread of clocks, wage-labor, and the working day have restructured our time-sense. With radio and internet comes a simultaneous stretch of time experience across the globe. It is possible that these changes resulted in a fundamental shift in our attention space with regard to our expressions of time.

ddlxdd commented 10 months ago

Hypothesis: I think in the "How are you today?" thread of the bipolar forum, the frequency and context of words related to extreme emotions will be high and more significant compare to other forum,reflecting the characteristic mood swings associated with bipolar disorder.

Since this might reveal patterns that align with the known symptoms of bipolar disorder, such as episodes of mania and depression.

Corpus: scrapped from psych forum, posts under the thread "How are you today".

donatellafelice commented 10 months ago

Hypothesis: i think that the use of the words women/female/those who partake in vaginally receptive sex/etc will have only very slightly increased in advertisements for prep released in the US over the last 5 years, even though multiple groups have stated it is their goal to increase awareness/uptake in women. However, I believe international advertisements released in places that have seen an increase in uptake will have a proportionately higher increase.

Why: For the last 5 years, uptake in PReP in women has been increasing internationally but not in the US/UK or EU. Many arguments are put forth regarding the reasoning (insurance, lack of trust in medical system, etc), but I wonder if the continued lag in the US is due to lack of awareness and even stigma created through general lack of ubiquity of the messaging. It is well known that multiple touchpoints and open and honest conversations by peers or persons of influence within a social group can decrease stigma and increase awareness.

Corpus: white papers and releases about PReP, transcripts for adds and announcements by pharmaceutical companies. no available yet

Marugannwg commented 10 months ago

Hypothesis/Hunch: The adoption and discussion of scientific coaching/pedagogy methods in music education is less rigorous and technology-driven compared to that in the sports/general education domain.

Reasoning:

Although there is a broader educational trend towards incorporating scientific methods and data-driven approaches in various fields, aspects like performance analytics and cognitive psychology in music instrumental playing seem less discussed than those around sports training.
(I have never heard of a prominent music performer who's famous for adopting modern technology, but many sports coaches do rigorously and share their schemes and even data-driven approaches)
Music learning is in nature takes place in the apprenticeship setting and based heavily on the social network of both the mentor and the student, which might have a different implication towards adopting avant-garde techniques compared to sports education. They can have different narratives about the pedagogy.

Corpus:

Academic papers and/or non-fiction books about music and sports education, particularly focusing on adopting new pedagogy or the integration of scientific methods. Look for terminology related to teaching methods, references to scientific principles or techniques.
Regarding availability, it's more sensible to focus on open access literature (e.g. on Google Scholar) that narrowed to particular instruments and sports (e.g. violin education vs basketball coaching)

Caojie2001 commented 10 months ago

Hypothesis: Information released by Chinese government (central or local) is highly related with specific public incidents such as implementation of policies relevat to certain social issues or public safety incidents. As online communities become increasingly important, the establishment of a 'good opinion base' has been a major task for governments of all levels in China. Therefore information released online by the governments become an ideal indicator of governmental behavior, as it's a by-product of the governmental daily operation. Information released by specific local governments that are related with other regions may also provide evidence for theories about inter-regional governmental relationship. Corpus: National and local newspapers, blogs or articles released by accounts of governments on various online social platforms.

runlinw0525 commented 10 months ago

Hypothesis: If the school has an university-wide guideline that tends to support generative AI tools, then I expect to see a similar supportive trend among certain sections within a collection of course syllabi labeled with its department, term, and course name. Why: I believe that if the university has an optimistic view towards powerful generative AI tools, then it should advise instructors to explicitly state the class policy on AI when designing their course syllabi, and I assume most course syllabi will align with the university's view. Although I do realize that some instructors have completely banned the usage of AI tools in their classes. Corpus: All course syllabi published in 2023 or after. They are scraped from the university's interactive syllabi archive and there are over 2000 course syllabi that fit. I am still working on data cleaning and integration for those syllabi so they won't be available for this week's class.

beilrz commented 10 months ago

Hypothesis: censorship in a region or platform can be detected through considering the regional anomaly of the n-grams and the probability of words following by a specific phrase.

Reasoning: mostly personal experience and common sense. For example, when you using search engine in China, the prediction/auto-fill of search terms, given your existing input, is very different than the prediction of term in Google. This is especially true for political related terms.

Corpus: assuming we want to detect censorship in China, we could use the text scarped from Chinese media. We could then use the corpus in US media or Wikipedia, as a baseline.

YucanLei commented 10 months ago

Hunch: The sharp drop in the review frequency of baby related products, particularly the ones that need to be brand new such as diapers and baby foods, can predict drop in birth rate.

This one is really intuitive. If you are having babies, you will need to buy baby products or you will be raising your child like medival. Purchasing an item means you have a chance of commenting on the item. Thus more comments means more purchase and more purchase means more babies, or vice versa.

cty20010831 commented 10 months ago

I am interested in whether there is an increase in mentioning of words related to quantitative or computational methods in the field of psychology.

I think this could be the case with the growing collaboration of psychology with computer science, data science, statistics, and neuroscience. In addition, with the development of big data, machine learning, and AI, there have been an increasing amount of quantitative or computational methods.

I intend to scrape papers using semantic scholar api (which I have tried for this week's homework).

LyuZejian commented 10 months ago

Question: measure the language circulation among communities.

Are there language, vocabulary, or other pattern that is transmitted from one community (like, a Reddit sub-reddit ) to another, and can we trace its flow? Similiar to the power-structure paper, I am also interesting the factors that orient, promote or prevent this kind of transmission.

erikaz1 commented 10 months ago

Hypothesis 1: Academic discourse may introduce and popularize specific terms, which may then be copied or adapted in mainstream media and social media platforms. Hypothesis 2: I hypothesize that there will be a temporal evolution in the lexicon and sentiment surrounding Critical Race Theory (CRT) (for instance) across different data sources, with distinct patterns/shifts in lexicon marked by significant political, economic, or international events (outside the field of origin).

Academic concepts sometimes precede and shape public discourse, but it is intuitively hard to believe that once-obscure academic terminology can easily make its way into everyday discourse without a catalyst. CRT came into the public eye in summer 2020 amidst George Floyd and a new conversation regarding ongoing pervasiveness of racism, after which right-wing politicians began to use the phrase for the opposite purpose, to discourage those very discussions. The nature of online discussions may have led to further rapid shifts in sentiment and language. By analyzing these linguistic dynamics across sources and over time, I aim to uncover patterns that reveal the flow of ideas from academic circles to broader public discussions, and vice versa.

The corpora for testing will include academic journal articles, a selection of news journals by slant (NYTimes, Fox, MSNBC...), and posts on Reddit discussing CRT. These three corpora represent different levels of discourse – scholarly, mainstream news, and social media. The data could be made available for evaluation, contingent on time and writing code for webscraping.

chenyt16 commented 10 months ago

Hypothesis: Media with different political leanings will exhibit different attitudes when reporting on abortion-related news.

Reasoning: Media organizations may align with specific ideologies or cater to particular demographic groups, influencing how they frame and present information on sensitive topics like abortion. For example, this FOX news outlets a strong negative tone towards abortion, which is rare to see in left-leaning media (https://www.foxnews.com/opinion/4-abortions-abortion-pill-words-women-getting-full-story). In addition, legislation regarding abortion varies across different states in the United States, which can make the reporting of local media more region-specific.

Corpus: news articles regarding abortion from different media in the past three years. Media includes New York Times, BBC, Washington Post, Bloomberg, Wall Street Journal, FOX News, The Blaze. (I haven't done the data collection yet.)

HamsterradYC commented 10 months ago

Hypothesis: In social media platforms, the expression of burnout will increase over time, particularly after major global or local events (e.g., pandemics, and economic crises).

Reasoning: Social media platforms often reflect collective experiences and reactions to societal events. Given the global challenges in recent years, such as the COVID-19 pandemic and various economic downturns, it is plausible that individuals increasingly turn to these platforms to express their feelings of burnout. The anonymity and reach of social media also make it a likely outlet for people to share their struggles with burnout, which might not be as openly discussed in offline settings.

Corpus: This hypothesis will be tested using a corpus of Reddit and Weibo posts spanning the last five years. A specific focus will be on tweets following major global events such as the onset of the COVID-19 pandemic.

XiaodiYang2001 commented 10 months ago

Hypothesis: Protesters' views will become stronger as a result of discussion on social media

The interactive features of social media will help protesters develop a sense of social validation and identity, thereby reinforcing their views. Taking the Sri Lankan protests as an example, by analyzing their tweets and getting the words they use to express their emotions, we might find that their wording becomes increasingly stronger.

corpus: Get the text data of tweets from Twitter through the official API.

michplunkett commented 10 months ago

Hypothesis: To what degree have white supremacist narratives crossed into our passed legislation at the national level?

Why: Kathleen Belew in her book Bring the War Home as well as the FBI in their 2005 report, have noted the infiltration of the United States military and law enforcement agencies by far-right and white supremacist organizations. It feels foolish to presume that they have simply stopped at those levels and have not, at the very least, influenced the writing of legislation at some level.

Corpus: The text documents of legislation that have passed through the senate as well as the text corpus used in this analysis.

Carolineyx commented 10 months ago

Hypothesis: How people define what is "a good life" across time and geographical regions.

Corpus: Titles and content of published books, titles and lyrics of songs, titles and lines of movies.

Reason: People often write, sing, and create stories about the ideal "good life" they want to live. However, with different societal contexts, cultures, and values, people may hold different views of what is a "good life." Therefore, we might detect how they are similar and different.

yueqil2 commented 10 months ago

Hypothesis: Americans' belief in China's bioweapons conspiracy theory is related to the negative portrayal of China by the US media during the Covid-19.

Reasoning: The negative media coverage will increase the anxiety inside public opinion and the hostile atmosphere, thus encouraging the mass conspiratorial thinking and influencing the cognitive judgment of individuals.

Corpus: China-related coverage from US Top3 most influential newspapers from early 2020 to late 2022.

QIXIN-LIN commented 10 months ago

Hypothesis: In recent years, fan-fiction narratives have increasingly incorporated more negative plot elements and language, reflecting a shift in media culture towards darker themes.

This hypothesis is based on the trend in mainstream media, where there is a noticeable shift towards darker, more complex themes. It's plausible that fan-fiction, often a reflection of its source material and the prevailing cultural trends, would exhibit similar shifts. To test this, I'll analyze fan-fiction from Archive of Our Own (AO3) using techniques like sentiment analysis and topic modeling. These methods will help identify changes in the emotional tone and thematic content over time. The corpus, enriched with context from Wikipedia and IMDb, will provide a comprehensive view of how fan-fiction has evolved, potentially mirroring trends in mainstream narratives.

yunfeiavawang commented 10 months ago

Hypothesis: Words used in fanfictions posted on China-based forums would be less erotic compared with those used on International forums.

Corpus: Mandarin fanfiction posted on Lofter (China-based forum) and AO3 (International platform)

Reasoning: Under the authoritarian regime, subcultures like fanfiction which primarily depicts relationships different from traditional heterosexual settings would be censored. Because of the combination of official censorship and self-censorship based on anticipated censor, the sensitive content in Chinese forums could be deleted or modified. Therefore, I would use record linkage to identify pairs of the same articles posted on different platforms and use word counting techniques to identify which words have disappeared in fanfiction onChina-based platforms.

floriatea commented 9 months ago

Hypothesis: The linguistic representation of "telehealth" in online discourse varies significantly between countries with established digital infrastructure and those still developing it, particularly in terms of optimism and focus on future possibilities.

Reasoning Behind the Hypothesis: This hypothesis arises from the understanding that the digital infrastructure of a country plays a crucial role in the adoption and perception of telehealth services. In countries with advanced digital infrastructure, discussions around telehealth are likely to focus on optimizing and expanding services, reflecting a forward-looking optimism. Conversely, in countries where digital infrastructure is still under development, the discourse might center on overcoming current limitations and the potential for future growth. The variation in discussions can be linguistically traced through the prevalence of certain keywords and phrases, such as "expanding access," "innovation," "technical difficulties," and "future of healthcare." Given the corpus includes country and text fields, it's poised for an analysis that contrasts these perspectives across different national contexts.

Corpus: NOW dataset from 2017 to 2023

muhua-h commented 8 months ago

On dating app profiles, religious individuals tend to have more conventional content. Reasoning: religious people might have more traditional views towards dating and relationship, particularly in regards to their views on their own role in relationship as well as their expectation for their partners. Corpus: OkCupid dataset on Kaggle

Brian-W00 commented 8 months ago

Hypothesis: The sentiment expressed in tweets about the economy positively correlates with the GDP growth rate of a country within the same timeframe, reflecting public perception's impact or prediction on economic performance.

I hypothesize that the happy words in tweets can tell us if a place's money success (GDP growth rate) is good or bad because people talk happier on Twitter when a place's economy goes well. This is because when there is more money, people usually feel better, which shows in how they talk online. To see if this is true, I will look at tweets' feelings from different places and the time they are written and compare this with how much their economy grows or not. For the study, I need to collect tweets that say where they are from and when, and this data may be shared in class for help, but I need to make sure it is safe for privacy.

UChicago-Computational-Content-Analysis / Readings-Responses-2024-Winter

2. Counting Words & Phrases to Trace the Distribution of Meaning -Challenge #49