Sampling, Crowd-Sourcing & Reliability - Challenge

jamesallenevans commented 3 years ago

First, pose a research question you would like to answer (in one, artfully worded sentence...ending with a question mark). This could be the same question you posed for the first week's assignment, or a new one that captures where your project is moving (hopefully toward your final project). Second, in a single-sentence list, describe all of the datasets and selections...e.g., REDDIT comments and responses from r/DonaldTrump through Jan. 8 when banned). This could also be the same as articulated in the first or second week, and parenthetically note whether it could be made available to class this week for evaluation (not required...but if you offer it, you might get some free work done!) Third, in a single sentence, describe one measurement you will use to assess your question with your dataset/corpora. Fourth, describe one or more biases resulting from your sample or your measurement that you would like your analysis to overcome. Please do NOT spend time/space explaining how you will de-bias or counter-bias your sample or your measure. (Then upvote the 5 most interesting, relevant and challenging challenge responses from others).

jinfei1125 commented 3 years ago

Research Question: What's the time trend of people's anxiety toward personal finance and disposable imcome?

Dataset: (could be made available to class this week for evaluation) 'Hot' Articles in the Personal Finance subreddit: Data

Note: the sample size is only 926 and the time period is from 2021-01-03 to 2021-01-28. I tried to expand the time period using the PRAW package but haven't figured out how (because there is only 926 hot or new articles, I guess the number of articles below each category in subreddit (hot, new, top, etc.) is limited to 1000. Would appreciate any advice to scrape the whole subreddit!

Measurement: The distribution of words; the frequency of words such as: housing, investment, debt, budgeting, tax, and so on

Bias: Generalization Bias--most users of Reddit are young people

jacyanthis commented 3 years ago

Question: How do the goals of artificial intelligence (accuracy, social benefit, fairness, interpretability, etc.) relate to each other and evolve over time?
Corpuses: Newspaper articles about AI (e.g. NOW, ProQuest), scholarly papers on AI in computer science, social science, and ethics (e.g. Scopus), analyst reports (e.g. Thomson One Investext), press releases (e.g. LexisNexis), mission statements (e.g. company websites), or earnings conference calls (e.g. Refinitiv). Approximately 1990 to 2020, focused on 2013 to 2020.
Measure: Degree of ethics framing: How oriented is it towards issues of fairness, bias, etc. relative to other aspects of AI discourse, such as profit and innovation?
Bias: This measure may be biased downward if certain aspects of ethical discourse are not measured. For example, if using keywords such as "bias" and "fairness," the sentence, "How do we ensure AI does not unduly harm certain people?" may not be captured in the metric.

joshuabsilver commented 3 years ago

Question How do historical events as narrated in history textbooks and popular synthetic histories relate to specialist historical research conducted by academic historians? Corpuses Google N-gram corpus. Books from commercial, educational publishers vs. books and articles (JSTOR`) from academic and university press publishers. Measurement Distribution of words and frequency of words showing unitary actors and abstract processes (such as states, leaders, events) vs. wider range of institutions, processes, events and contingency. Evaluative terms associated with these people, institutions, and events. Bias Corpora and measurement would not capture the circulation numbers of texts. Unclear how to test for political/ideological biases.

egemenpamukcu commented 3 years ago

Question: Can we predict the 'winner' of a debate in the eyes of an audience?

Corpus: Intelligence Squared and Munk Debates debate transcripts (not available as a corpus yet), and audience votes in regards to debated statement both before and after the debate (to train the algorithm).

Measurement: Debates that have a declared winner (measured by the difference in audience votes) can be used to measure the accuracy. Use of vocabulary, grammar and positivity/negativity in language can be introduced as predictors.

Bias: It would be tough to obtain a large corpus of debate transcripts, therefore there could be a risk of overfitting the training data. The obtained sample would represent only debate platforms that share transcript text. Moreover, in general, results of academic debates may not generalize well into debates on political and social issues where partisanship can be a factor.

xxicheng commented 3 years ago

Research question: Over the past century, had it become easier for kids from middle-class families to engage in high culture?

Corpus: Biographies of Major American orchestras musicians on the Stokowski website, and musicians’ Wikipedia pages (if available).

Measurement: The distribution of parental occupation.

Bias: The possibility of mentioning parents' occupations is higher if their parents are musicians, too.

hesongrun commented 3 years ago

Question: How much can machines learn to predict stock returns using Chinese textual data?

Corpus: Company announcement from Juchao.com, which is the official information disclosure platform designated by the China Securities Regulatory Commission.(not available at present, I am still cleaning the data).

Measurement: We are going to rely on the bag-of-word representation. There are three steps in constructing document sentiments: (1) screen for sentiment charged words. (2) Get the term weight via supervised topic modeling. (3) aggregate word-level sentiment to get the article level sentiment. Finally we are going to build trading strategy of stock portfolios by linking article sentiment to stock returns.

Bias: (Possible)The unbalanced nature of stock announcement. Some stocks may make more announcements than others. The resulting strategy will be overly dominant by these stocks. Stock announcements may be distributed unevenly across time.

william-wei-zhu commented 3 years ago

Question: Are new CEOs appointed from outside more likely to change company culture than new CEOs promoted from within?

data: Glassdoor company review data

Bias: Glassdoor only contain enough reviews for large companies. Difficult to collect sufficient reviews on smaller companies.

theoevans1 commented 3 years ago

Question: How do fanfiction texts differ from their source material in regards to diversity and inclusion?

Dataset: Davies TV and Movie Corpora compared to stories from https://www.fanfiction.net/ and http://archiveofourown.org/

Measurement: Use of inclusive language (terminology related to gender/LGBT identities, race, disability, etc.), and words used to describe characters belonging to different groups

Bias: The classification of words considered inclusive, and the classification of words used to describe characters as positive or negative

mingtao-gao commented 3 years ago

Research question: How do brand-related user generated contents (UGC) differ across theoretically categorized social media platforms? In other words, do conceptual categorizations of social media platforms in fact influence brand-related UGC and consumer engagement?

Source: Scraping social media content using public API. Different categorized social media including Facebook (relationship media), Twitter (self-media), Instagram (creative outlets), and Reddit (collaboration platforms).

Measurement: use sentiment analysis and topic modeling to compare differences between UGC from different channels

Bias: sampling bias, because data collected from different platforms cannot be very inclusive and the results may also be depended on the time of data collection

romanticmonkey commented 3 years ago

Research Question: Have movie and TV reviewers (non-professional) changed their focus of discussion over time?

Source: Amazon Movies and TV reviews (1996-2018) (https://nijianmo.github.io/amazon/index.html)

Measurement: N-gram features signifying the focus of content (e.g. synopsis, quality, plot); by year, top frequency word, tf-idf, and topic modeling

Bias: (1) the selection of films: the collection might not cover all genres of film and each year might have different proportions of genres; (2) biased user population: this dataset only accounts for Amazon movie reviewers, not considering those active on IMDB, RottenTomatoes, etc.

chuqingzhao commented 3 years ago

Research Question: How does science respond to COVID-19 pandemic? Particularly, what the diffusion dynamic of knowledge related to COVID-19 changes over time, and what's the relationship between publications?

Source: CORD-19 dataset

Measurement: topic modelling, cluster analysis, word2vec, network analysis(maybe random diffusion model like IRN...）

Bias: (1) CORD-19 dataset does not cover all publication. Collecting papers from semantic scholar does not represent the population of publication. (2) data in some marginal topics might be very sparse.

k-partha commented 3 years ago

Research Question: How have social and economic themes associated with the discourse on cryptocurrencies and decentralised economies evolved over the past year?

Source: Reddit forums: https://www.reddit.com/r/CryptoCurrency/, https://www.reddit.com/r/Bitcoin/

Measurement: Topic modelling, N-gram frequencies

Bias: Discourse on Reddit disproportionately represents views of younger, white, males. Other social media could have discourse markers with more inclusive/different demographics.

jcvotava commented 3 years ago

Research Question: What transformations have taken place in left-wing ideology and discourse as a function of time, in particular of the First International vs. Second International vs. modern inheritors (ex. Frankfurt School)?

Source: Marxists.org archives - https://www.marxists.org/archive/index.htm

Measurement: Topic modeling, n-gram frequency, changes in the usage of words (part-of-speech changes and/or changes in co-occuring words)

Bias: The corpus is a "convenience" sample in that it represents texts which have already been digitized and uploaded to one particular archive. Thus there may be latent bias inherent to the act of curation itself. (One mitigating factor is that these texts probably tend to be the most popular/impactful texts of the discourse, an assertion which could probably be supported empirically.)

MOTOKU666 commented 3 years ago

RQ: How are the Latino-paradox, especially for immigrants (Latino/Hispanic American are generally healthier than others even though they have relatively poor education and SES ) these days? Would we see a difference in their diet, habits, and behaviors? Source: Twitter, Ins Public Data. Reddit in Latino/Hispanic sub Measurement: Topic modeling, n-gram frequencies Bias: Those who like to post are likely the younger generation. Hard to see whether the person is an immigrant or America-born Latino, resulting in the problem of having causal inference.

Raychanan commented 3 years ago

Research Question: How have the trend in American concerns about the COVID-19 pandemic changed?

Datasets: Twitter API Reddit Facebook

Measurement: The frequency of words.

Bias: Twitter users in the United States represent only a fraction of the total U.S. population, and the characteristics of Twitter users do not necessarily reflect the characteristics of all Americans.

dtmlinh commented 3 years ago

Research Question: What's the relationship between news media coverage and presidential speeches on the topic of climate change? Does this relationship vary across news sources and time?

Dataset: The NOW Corpus (a sample of it), US presidential speeches corpus (https://millercenter.org/the-presidency/presidential-speeches)

Measurement: frequency of pro-climate change mitigation vs. against climate change mitigation, positive coverage vs negative coverage, clustering of coverage sentiments by news sources

Bias: News articles in the NOW Corpus are heavily skewed towards pro-climate change mitigation; hence, sampling this Corpus for a good sample is tricky

RobertoBarrosoLuque commented 3 years ago

Research question: What is the relationship between news coverage, presidential rhetoric and approval rating?

Dataset: NOW Corpus, US presidential speeches corpus.

Measurement: Sentiment depicted by news coverage towards presidents.

Bias: Sentiment models are still work's in progress and designing an objective sentiment towards named entity algorithm might result in biased sentiment scores.

lilygrier commented 3 years ago

Research question: What is the relationship between presidential speeches related to climate change/energy policy and climate legislation introduced during various presidencies? Working with @RobertoBarrosoLuque and @dtmlinh, hence the similarities in topics (but different angles).

Dataset: I downloaded this corpus of presidential speeches. It includes one folder for each president and a .txt file for each speech, so requires some wrangling to get it in a single file and is probably a little clunky for the purpose of today's exercises. For legislation, I've used this corpus of congressional bills. Easy to download and could be used in class.

Measurement: Topic modeling, human perceptions of the extent to which rhetoric prioritizes fighting climate change, ngram frequencies and collocations (i.e., energy security/independence vs. renewable energy)

Bias: I'm especially concerned about sampling bias. If I were to attempt to choose representative samples of a president's climate change rhetoric by hand, it would be impossible for me not to choose texts that confirm my beliefs about a president's policies (especially for presidents I perceive as anti-climate). I also think the details here might be more subtle, as no president is going to say "let's destroy the earth." It might come down to emphasizing jobs in the coal industry and the US energy economy more than emphasizing harms from fossil fuels. Sampling only explicitly climate-focused speeches may not pick up on these things, and I'm not sure how to account for this.

toecn commented 3 years ago

RQ: 1) Has political speech by politicians in Twitter evolved into more successful forms? 2a) In what way are successful speeches similar and different? 2b) In what ways unsuccessful speeches?

Dataset: Twitter speech by presidential candidates in Colombia (2010-2018).

Meassument: IV: Word2Vec DV: Likes, RT, electoral outcomes For measures of similarity: Kullback-Leibler (KL) divergence, X^2 divergence, Kolmogorov-Smirnov (KS) distance, Wasserstein distance

Biases: the measures of engagement might mean something different, the electoral outcomes can be influenced by many other variables.

sabinahartnett commented 3 years ago

Question: What language/themes are propagated by extremist-affiliated blogs, social media accounts and other news outlets? And how might these trends seep into and be reflected in online civic discourse?

Dataset: Social media platforms - via public API (incl. Twitter, Youtube, FB) Instant-messaging platforms (incl. WhatsApp, Slack) Parler, Gap, other host sites for more extreme content

Measurement: topic models, word embeddings, sentiment & unique word frequencies

Biases: Confounding factors, multiple word definitions, user populations

yushiouwillylin commented 3 years ago

Question: Do different social science field exhibit different political ideology, such as right or left, liberal or conservative? Or are there any time trends that seems to occur in different social science fields.

Dataset: Academic paper dataset in COCA, or other corpora that can be found on line.

Measurement: If we can identify a set of words in different political ideology, word frequency might be a good idea.

Biases: If we want to identify any casual relation, we might need to take a greater view in say, news or other media change as well.

UChicago-CCA-2021 / Readings-Responses

Sampling, Crowd-Sourcing & Reliability - Challenge #49