Aggregator Functions for gender_frequency.py

kenalba commented 3 years ago

Performing analysis on a corpus with gender_frequency is tricky. We get document-level results, which is great, but I had to use statistics.mean to hack together averages out of the dictionaries that get outputted. I would suggest making these aggregated, corpus-level results easy to access! This is a problem for pronoun_frequency and for corpus_subject_object_freq.

We might also think about graphing these aggregated results, since this data pops pretty well!

MBJean commented 3 years ago

I've been hacking at this a bit today. Here's some usage code:

Setup:

>>> from corpus_analysis.testing import common
>>> from corpus_analysis import corpus
>>> from gender_analysis.analysis.frequency import GenderFrequencyAnalyzer
>>> corpus = corpus.Corpus(common.TEST_CORPUS_PATH, csv_path=common.LARGE_TEST_CORPUS_CSV)
>>> analyzer = GenderFrequencyAnalyzer(corpus)

Organized by document:

>>> analyzer.by_document()
{'aanrud_longfrock': {'Female': {'her': 528, 'hers': 3, 'she': 614, 'herself': 55}, 'Male': {'himself': 10, 'him': 55, ...

>>> analyzer.by_document(display='frequency')
{'aanrud_longfrock': {'Female': {'her': 0.015796559461480928, 'hers': 8.975317875841435e-05, ...

>>> analyzer.by_document(display='relative')
{'aanrud_longfrock': {'Female': {'her': 0.3369495851946394, 'hers': 0.0019144862795149964, ...

>>> analyzer.by_document(display='relative', labeled=True)
{'aanrud_longfrock': {'Female': {'subject': 0.3918, 'object': 0.3369}, 'Male': {'subject': 0.1346, ...

Organizing by gender:

>>> analyzer.by_gender()
{'Female': Counter({'her': 92697, 'she': 75967, 'herself': 5690, 'hers': 750}), 'Male': Counter({'he': 132732, ...

>>> analyzer.by_gender(display='relative')
{'Female': {'her': 0.1957739339795983, 'hers': 0.001583982766267503, 'she': 0.16044055840672453, ...

>>> analyzer.by_gender(display='relative', labeled=True)
{'Female': {'subject': 0.1604, 'object': 0.1957}, 'Male': {'subject': 0.2803, ...

Organizing by identifiers:

>>> analyzer.by_identifier()
{'her': 92697, 'hers': 750, 'she': 75967, 'herself': 5690, 'himself': 10329, 'him': 53495, 'he': 132732, 'his': 101830}

>>> analyzer.by_identifier(display='relative', labeled=True)
{'subject': 0.44076749244968216, 'object': 0.30875414475490504}

Organizing by arbitrary metadata:

>>> analyzer.by_metadata('author_gender', display='relative', labeled=True)
{'male': {'Female': {'subject': 0.1202, 'object': 0.1547}, 'Male': {'subject': 0.3109, ...

MBJean commented 3 years ago

Some more hacking, changed the optional kwargs and added an 'aggregate' option. Some more usage code:

Setup:

>>> from corpus_analysis.testing import common
>>> from corpus_analysis import corpus
>>> from gender_analysis.analysis.frequency import GenderFrequencyAnalyzer
>>> import pandas as pd
>>> import matplotlib.pyplot as plt
>>> corpus = corpus.Corpus(common.TEST_CORPUS_PATH, csv_path=common.LARGE_TEST_CORPUS_CSV)
>>> analyzer = GenderFrequencyAnalyzer(corpus)

Grouping by metadata:

>>> analyzer.by_metadata('author_gender', format_by='relative', group_by='aggregate')
{'male': {'Female': 0.28459318058812916, 'Male': 0.7154068194118709}, 'female': {'Female': 0.4606421117631641, 'Male': 0.539357888236836}, 'both': {'Female': 0.4454080498109851, 'Male': 0.5545919501890149}}

Visualizing:

df = pd.DataFrame(_)
df.plot(kind='pie', subplots=True)
plt.show()

Output:

ryaanahmed commented 3 years ago

This interface looks great to me. Let's see if we can move towards having the visualized output be part of the individual analyzer classes as well, rather requiring the user to pipe the data around.

MBJean commented 3 years ago

Adding some easy visualization functionality seems legit to me, though I think that it warrants some thinking about. In general, I think there's room to improve the output of these functions. These (generally 3D) dictionaries are useful data structures, but require transformation for the kinds of contexts we can imagine our users will be working in (DataFrame, visualizations, etc.). My plan is to create an issue on this topic and tackle it during the first of our summer sprints.

MBJean commented 3 years ago

Closing in favor of #170.

dhmit / gender_analysis

Aggregator Functions for gender_frequency.py #165