Overview

This PR introduces a new analysis module proximity that replaces the existing gender_adjective module. The new module fully duplicates the functionality of gender_adjective with the following changes:

Introduces a new class- and method-based interface for performing the various analyses in the module.
Allows the user to input any NLTK tags rather than forcing the use of adjective tags "JJ", "JJR", "JJS".
Fixes a few bugs found in the existing implementation.

Notes

I've left the gender_analysis/analysis/gender_adjective.py module in place. This is temporary to ensure that those of us who start using the new module can compare the results and ensure all functionality is fully accounted for. I'll remove the gender_adjective module if everything seems legit.

I've introduced a few things in addition to the new module to assist the user.

The user can call the class method GenderProximityAnalyzer.list_nltk_tags() to retrieve a human readable list of possible NLTK tags.
The use can import find_in_document_gender, find_in_document_male, or find_in_document_female from proximity, all of which accept a single Document instance, if they don't want the full analytical capabilities of the GenderProximityAnalyzer class.
I've introduced the label attribute on Document to provide a cleaner, readable interface for the user (i.e., they can now traverse the results of some of the GenderProximityAnalyzer instance methods by using .get(document.label)).

Usage code

Given the following setup:

corpus = Corpus(TEST_CORPUS_PATH, csv_path=SMALL_TEST_CORPUS_CSV)
proximity_analyzer = GenderProximityAnalyzer(corpus)

>>> proximity_analyzer.by_metadata('author_gender')
{'male': {'Female': {'eighth': 1, 'doubtless': 1, 'long': 36, 'little': 40, 'hoel': 1, 'disguised': 1, 'first': 30, ...

>>> proximity_analyzer.by_metadata('author_gender', sort=True)
{'male': {'Female': [('little', 40), ('long', 36), ('first', 30), ('own', 22), ('last', 21), ('good', 20), ...

>>> proximity_analyzer.by_metadata('author_gender', sort=True, limit=3)
{'male': {'Female': [('little', 40), ('long', 36), ('first', 30)], 'Male': [('own', 57), ('first', 41), ...

>>> proximity_analyzer.by_metadata('author_gender', sort=True, limit=3, remove_swords=True)
{'male': {'Female': [('little', 40), ('long', 36), ('first', 30)], 'Male': [('first', 41), ...

>>> proximity_analyzer.by_metadata('author_gender', diff=True, sort=True)
{'male': {'Female': [('long', 22), ('little', 19), ('lisbeth', 18), ('big', 14), ('last', 8), ...

Codecov Report

Merging #159 (2032371) into master (6850583) will increase coverage by 2.83%. The diff coverage is 78.47%.

@@            Coverage Diff             @@
##           master     #159      +/-   ##
==========================================
+ Coverage   57.59%   60.42%   +2.83%     
==========================================
  Files          11       12       +1     
  Lines        1408     1630     +222     
  Branches      355      434      +79     
==========================================
+ Hits          811      985     +174     
- Misses        543      573      +30     
- Partials       54       72      +18

Impacted Files	Coverage Δ
corpus_analysis/document.py	`84.48% <50.00%> (-0.41%)`	:arrow_down:
gender_analysis/analysis/proximity.py	`78.63% <78.63%> (ø)`
gender_analysis/analysis/__init__.py	`100.00% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 6850583...2032371. Read the comment docs.

dhmit / gender_analysis

Introduces gender_analysis/analysis/proximity.py #159

Overview

Notes

Usage code

Codecov Report