dhmit / gender_analysis

A toolkit for analyzing gendered language across sets of documents
BSD 3-Clause "New" or "Revised" License
11 stars 5 forks source link

Pre-release testing guide #156

Closed MBJean closed 3 years ago

MBJean commented 3 years ago

Overview

This issue outlines our proposed testing plan for the GATK. In preparation for release to PyPI, the working group intends to pressure test the GATK using real, sample datasets and imagined user personas. The goal will be for each member of the working group to collect a set of notes in this issue that we can use to construct our final set of issues prior to release.

Requirements for getting started

  1. Merge open PRs:
    • [x] #151
    • [x] #152
    • [x] #153
    • [x] #154
  2. The new Quickstart Guide (#154) is updated with gender_tokens.py.
  3. The Practice datasets below are updated to include relevant events.
  4. Prompts 6-7 of Practice research task are filled out to include additional analytical analyses from gender_analysis.analysis.

Step-by-step guide

  1. Choose a persona (below) and choose a practice dataset (below). Read through the Quickstart Guide with both the persona and dataset in mind. Take notes as you attempt to follow the guide. Given your chosen persona, what might not be clear or obvious? What might be particularly effective? If you were to begin attempting to following the guide using your chosen dataset, what challenges (if any) might your persona experience?
  2. Choose one of the datasets and follow the steps in the Practice research task below. Collect notes about your experience. What was easily done using the GATK? What wasn't as easily done using the GATK? How did you leverage the GATK to complete the prompts? How did you discover where to go in the GATK to complete each prompt?

User personas

Practice datasets

  1. Reddit corpus: a sampling of top posts and comments from the r/starwars subreddit. The event of relevance for the Practice research task below is TBD.
  2. Core Drama 1660: a corpus of professional and other plays (cut-off date 1660) intended for performance, as well as translations and closet drama. One event of relevance for the Practice research task below is the accession of Elizabeth I to the throne (17 November 1558).
  3. Fanfic and Canon, 19th Century: a set of corpora with fanfiction from AO3 about canonical 19th century novels, along with the original novels.

Practice research task

For your chosen dataset:

  1. What's the frequency of binary gendered pronouns?
  2. Introduce a non-binary gendered pronoun set.
  3. What's the frequency of the above non-binary gendered pronoun set?
  4. What're the most common nouns associated with each of the above genders (non-binary and binary set)?
  5. What're the most common adjectives associated with each of the above genders?
  6. For the analysis task of your choice, what differences do you see before and after a specific, date-based event relevant for your corpus?
  7. TBD
  8. What're the most common adjectives associated with each of the above genders before and after the event described in their respective entry in Practice dataset above?
  9. Produce a data visualization for steps 3-8 above.
fyang3 commented 3 years ago

Missing metadata and corpus creation guide: https://docs.google.com/document/d/130pWbn734Bx2ZS314BQoBVm8n4M915wbr-k62YX9D2k/edit?usp=sharing

kenalba commented 3 years ago

Another set of test corpora, this one (largely) fanfictional, "Fanfic and Canon, 19th Century."

https://drive.google.com/drive/folders/1fgCSnWZRmRgMpYZdJUxmtawpH2QDWCQG?usp=sharing

There are currently 7 corpora in this folder:

frankenstein_fanfic.zip, 463 fanfics from An Archive Of Our Own tagged as "Frankenstein - Mary Shelley," downloaded Jan 2021. emma_fanfic, 214 fanfics from AO3 tagged as "Emma - Jane Austen", downloaded Jan 2021. emma_fanfic, 1614 fanfics from AO3 tagged as "Pride and Prejudice - Jane Austen", downloaded Jan 2021. littlewomen_fanfic, 445 fanfics from AO3 tagged as "Little Women - Louisa May Alcott", downloaded April 2021. dracula_fanfic, 719 fanfics from AO3 tagged as "Dracula - Bram Stoker", downloaded April 2021. canon, which includes Emma, Pride and Prejudice, Frankenstein, Dracula, and Little Women downloaded and sanitized from Gutenberg. juno_fanfic, 1863 fanfics from AO3 tagged as "The Penumbra Podcast", a narrative podcast with many nonbinary characters.

All of them have appropriately formed metadata for the Gender Analysis Toolkit.

I've edited and added this link above.

erica02139 commented 3 years ago

Under "2. Practice Datasets," the event of relevance is the the accession of Elizabeth I (17 Nov 1558: so likely, before/after 1559). Before and after Mary I (July 1553) would also be relevant.

Also, we've changed the early modern dataset: it's now "Core Drama 1660," which is linked here. I've updated the "Overview" above.

That's the one with which @fyang3 has been working, and for which she's updated metadata.

fyang3 commented 3 years ago

Testing Journal with observations and bugs: https://docs.google.com/document/d/1-Uz__nxeRn2OHjPd8Q6Gz3Z4cFOa6hA8wabczF0YKQ4/edit?usp=sharing

MBJean commented 3 years ago

First round of my notes:

Ideas

Completed

MBJean commented 3 years ago

Closing in favor of #170