dhmit / gender_analysis

A toolkit for analyzing gendered language across sets of documents
BSD 3-Clause "New" or "Revised" License
11 stars 5 forks source link

Introduces gender_analysis/analysis/proximity.py #159

Closed MBJean closed 3 years ago

MBJean commented 3 years ago

Overview

This PR introduces a new analysis module proximity that replaces the existing gender_adjective module. The new module fully duplicates the functionality of gender_adjective with the following changes:

Notes

I've left the gender_analysis/analysis/gender_adjective.py module in place. This is temporary to ensure that those of us who start using the new module can compare the results and ensure all functionality is fully accounted for. I'll remove the gender_adjective module if everything seems legit.

I've introduced a few things in addition to the new module to assist the user.

Usage code

Given the following setup:

corpus = Corpus(TEST_CORPUS_PATH, csv_path=SMALL_TEST_CORPUS_CSV)
proximity_analyzer = GenderProximityAnalyzer(corpus)
>>> proximity_analyzer.by_metadata('author_gender')
{'male': {'Female': {'eighth': 1, 'doubtless': 1, 'long': 36, 'little': 40, 'hoel': 1, 'disguised': 1, 'first': 30, ...

>>> proximity_analyzer.by_metadata('author_gender', sort=True)
{'male': {'Female': [('little', 40), ('long', 36), ('first', 30), ('own', 22), ('last', 21), ('good', 20), ...

>>> proximity_analyzer.by_metadata('author_gender', sort=True, limit=3)
{'male': {'Female': [('little', 40), ('long', 36), ('first', 30)], 'Male': [('own', 57), ('first', 41), ...

>>> proximity_analyzer.by_metadata('author_gender', sort=True, limit=3, remove_swords=True)
{'male': {'Female': [('little', 40), ('long', 36), ('first', 30)], 'Male': [('first', 41), ...

>>> proximity_analyzer.by_metadata('author_gender', diff=True, sort=True)
{'male': {'Female': [('long', 22), ('little', 19), ('lisbeth', 18), ('big', 14), ('last', 8), ...
codecov-commenter commented 3 years ago

Codecov Report

Merging #159 (2032371) into master (6850583) will increase coverage by 2.83%. The diff coverage is 78.47%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #159      +/-   ##
==========================================
+ Coverage   57.59%   60.42%   +2.83%     
==========================================
  Files          11       12       +1     
  Lines        1408     1630     +222     
  Branches      355      434      +79     
==========================================
+ Hits          811      985     +174     
- Misses        543      573      +30     
- Partials       54       72      +18     
Impacted Files Coverage Δ
corpus_analysis/document.py 84.48% <50.00%> (-0.41%) :arrow_down:
gender_analysis/analysis/proximity.py 78.63% <78.63%> (ø)
gender_analysis/analysis/__init__.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 6850583...2032371. Read the comment docs.