MBJean commented 3 years ago

Overview

This issue outlines our proposed testing plan for the GATK. In preparation for release to PyPI, the working group intends to pressure test the GATK using real, sample datasets and imagined user personas. The goal will be for each member of the working group to collect a set of notes in this issue that we can use to construct our final set of issues prior to release.

Requirements for getting started

Merge open PRs:
- [x] #151
- [x] #152
- [x] #153
- [x] #154
The new Quickstart Guide (#154) is updated with gender_tokens.py.
The Practice datasets below are updated to include relevant events.
Prompts 6-7 of Practice research task are filled out to include additional analytical analyses from gender_analysis.analysis.

Step-by-step guide

Choose a persona (below) and choose a practice dataset (below). Read through the Quickstart Guide with both the persona and dataset in mind. Take notes as you attempt to follow the guide. Given your chosen persona, what might not be clear or obvious? What might be particularly effective? If you were to begin attempting to following the guide using your chosen dataset, what challenges (if any) might your persona experience?
Choose one of the datasets and follow the steps in the Practice research task below. Collect notes about your experience. What was easily done using the GATK? What wasn't as easily done using the GATK? How did you leverage the GATK to complete the prompts? How did you discover where to go in the GATK to complete each prompt?

User personas

Beginner: A user for whom working with the GATK is their first experience using computational techniques on a humanities dataset.
Programming Humanist: A user who has used Python before and has engaged with humanist datasets, but would not self-identify as a developer. Example: a research librarian with access to the Core Drama 1660 dataset described in Practice datasets below who has used Python's NLTK package in the past to perform basic tokenization and word frequency counting.
Humanist Programmer: A user who would self-identify as a developer, is knowledgeable of Python and its ecosystem, and is interested in applying their experience with these tools to a humanist dataset.

Practice datasets

Reddit corpus: a sampling of top posts and comments from the r/starwars subreddit. The event of relevance for the Practice research task below is TBD.
Core Drama 1660: a corpus of professional and other plays (cut-off date 1660) intended for performance, as well as translations and closet drama. One event of relevance for the Practice research task below is the accession of Elizabeth I to the throne (17 November 1558).
Fanfic and Canon, 19th Century: a set of corpora with fanfiction from AO3 about canonical 19th century novels, along with the original novels.

Practice research task

For your chosen dataset:

What's the frequency of binary gendered pronouns?
Introduce a non-binary gendered pronoun set.
What's the frequency of the above non-binary gendered pronoun set?
What're the most common nouns associated with each of the above genders (non-binary and binary set)?
What're the most common adjectives associated with each of the above genders?
For the analysis task of your choice, what differences do you see before and after a specific, date-based event relevant for your corpus?
TBD
What're the most common adjectives associated with each of the above genders before and after the event described in their respective entry in Practice dataset above?
Produce a data visualization for steps 3-8 above.

fyang3 commented 3 years ago

Missing metadata and corpus creation guide: https://docs.google.com/document/d/130pWbn734Bx2ZS314BQoBVm8n4M915wbr-k62YX9D2k/edit?usp=sharing

kenalba commented 3 years ago

Another set of test corpora, this one (largely) fanfictional, "Fanfic and Canon, 19th Century."

https://drive.google.com/drive/folders/1fgCSnWZRmRgMpYZdJUxmtawpH2QDWCQG?usp=sharing

There are currently 7 corpora in this folder:

frankenstein_fanfic.zip, 463 fanfics from An Archive Of Our Own tagged as "Frankenstein - Mary Shelley," downloaded Jan 2021. emma_fanfic, 214 fanfics from AO3 tagged as "Emma - Jane Austen", downloaded Jan 2021. emma_fanfic, 1614 fanfics from AO3 tagged as "Pride and Prejudice - Jane Austen", downloaded Jan 2021. littlewomen_fanfic, 445 fanfics from AO3 tagged as "Little Women - Louisa May Alcott", downloaded April 2021. dracula_fanfic, 719 fanfics from AO3 tagged as "Dracula - Bram Stoker", downloaded April 2021. canon, which includes Emma, Pride and Prejudice, Frankenstein, Dracula, and Little Women downloaded and sanitized from Gutenberg. juno_fanfic, 1863 fanfics from AO3 tagged as "The Penumbra Podcast", a narrative podcast with many nonbinary characters.

All of them have appropriately formed metadata for the Gender Analysis Toolkit.

I've edited and added this link above.

erica02139 commented 3 years ago

Under "2. Practice Datasets," the event of relevance is the the accession of Elizabeth I (17 Nov 1558: so likely, before/after 1559). Before and after Mary I (July 1553) would also be relevant.

Also, we've changed the early modern dataset: it's now "Core Drama 1660," which is linked here. I've updated the "Overview" above.

That's the one with which @fyang3 has been working, and for which she's updated metadata.

fyang3 commented 3 years ago

Testing Journal with observations and bugs: https://docs.google.com/document/d/1-Uz__nxeRn2OHjPd8Q6Gz3Z4cFOa6hA8wabczF0YKQ4/edit?usp=sharing

MBJean commented 3 years ago

First round of my notes:

Ideas

Why does importing the Corpus class seem to take an inordinate amount of time for Sam?
Could we create a .summary on Corpus? What would it print to the user?
In the API docs, Document doesn't have fields described.
Should help() on a particular class or method be more useful?
Document.word_count could use a remove_swords flag.
Document.get_wordcount_counter() could use a remove_swords flag.
Could we consider making Document.words_associated() and Document.get_word_windows() more sophisticated? These method measures only the occurrence of a word immediately after or within a window of the target word in the tokenized text. We're definitely picking up associations across syntactical breaks. While these undoubtedly have an association with the target word, they're 'less' associated than, say, words that are in an object-relationship with the target word.
For many of the Document methods, could we pull those out to the Corpus class and organize the return by Document label? That way the interface would always be the Corpus.
Document.update_metadata() would require us to update any cached returns in the analysis modules.
Corpus.count_authors_by_gender() takes a string argument to represent gender. Is that standard throughout the analysis modules?
Corpus.get_wordcount_counter() could use a remove_swords flag.
Corpus.get_field_vals() could be supplanted by a .summary vel sim.
Corpus.get_sample_text_passages() returns a list of tuples of the shape Tuple[str(document filename), str(sentence)]. It might be more useful to return a dictionary with the document filename/label (filename minus extension) as the top-level keys. This data structure is very common throughout the analysis modules.
dunning.dunn_individual_word_by_corpus() throws ZeroDivisionError if the target word doesn't exist in a Corpus.
gender_frequency.get_count_words could probably be moved to Document and/or Corpus, and probably should return a Counter.
gender_frequency.run_gender_freq() fails without first creating a visualizations directory.
Any reason metadata_visualizations shouldn't be methods on the Corpus class?

Completed

[x] How about a .label field on Document?

MBJean commented 3 years ago

Closing in favor of #170

dhmit / gender_analysis

Pre-release testing guide #156

Overview

Requirements for getting started

Step-by-step guide

User personas

Practice datasets

Practice research task

First round of my notes:

Ideas

Completed