Closed kenalba closed 3 years ago
I've been hacking at this a bit today. Here's some usage code:
Setup:
>>> from corpus_analysis.testing import common
>>> from corpus_analysis import corpus
>>> from gender_analysis.analysis.frequency import GenderFrequencyAnalyzer
>>> corpus = corpus.Corpus(common.TEST_CORPUS_PATH, csv_path=common.LARGE_TEST_CORPUS_CSV)
>>> analyzer = GenderFrequencyAnalyzer(corpus)
Organized by document:
>>> analyzer.by_document()
{'aanrud_longfrock': {'Female': {'her': 528, 'hers': 3, 'she': 614, 'herself': 55}, 'Male': {'himself': 10, 'him': 55, ...
>>> analyzer.by_document(display='frequency')
{'aanrud_longfrock': {'Female': {'her': 0.015796559461480928, 'hers': 8.975317875841435e-05, ...
>>> analyzer.by_document(display='relative')
{'aanrud_longfrock': {'Female': {'her': 0.3369495851946394, 'hers': 0.0019144862795149964, ...
>>> analyzer.by_document(display='relative', labeled=True)
{'aanrud_longfrock': {'Female': {'subject': 0.3918, 'object': 0.3369}, 'Male': {'subject': 0.1346, ...
Organizing by gender:
>>> analyzer.by_gender()
{'Female': Counter({'her': 92697, 'she': 75967, 'herself': 5690, 'hers': 750}), 'Male': Counter({'he': 132732, ...
>>> analyzer.by_gender(display='relative')
{'Female': {'her': 0.1957739339795983, 'hers': 0.001583982766267503, 'she': 0.16044055840672453, ...
>>> analyzer.by_gender(display='relative', labeled=True)
{'Female': {'subject': 0.1604, 'object': 0.1957}, 'Male': {'subject': 0.2803, ...
Organizing by identifiers:
>>> analyzer.by_identifier()
{'her': 92697, 'hers': 750, 'she': 75967, 'herself': 5690, 'himself': 10329, 'him': 53495, 'he': 132732, 'his': 101830}
>>> analyzer.by_identifier(display='relative', labeled=True)
{'subject': 0.44076749244968216, 'object': 0.30875414475490504}
Organizing by arbitrary metadata:
>>> analyzer.by_metadata('author_gender', display='relative', labeled=True)
{'male': {'Female': {'subject': 0.1202, 'object': 0.1547}, 'Male': {'subject': 0.3109, ...
Some more hacking, changed the optional kwargs and added an 'aggregate' option. Some more usage code:
Setup:
>>> from corpus_analysis.testing import common
>>> from corpus_analysis import corpus
>>> from gender_analysis.analysis.frequency import GenderFrequencyAnalyzer
>>> import pandas as pd
>>> import matplotlib.pyplot as plt
>>> corpus = corpus.Corpus(common.TEST_CORPUS_PATH, csv_path=common.LARGE_TEST_CORPUS_CSV)
>>> analyzer = GenderFrequencyAnalyzer(corpus)
Grouping by metadata:
>>> analyzer.by_metadata('author_gender', format_by='relative', group_by='aggregate')
{'male': {'Female': 0.28459318058812916, 'Male': 0.7154068194118709}, 'female': {'Female': 0.4606421117631641, 'Male': 0.539357888236836}, 'both': {'Female': 0.4454080498109851, 'Male': 0.5545919501890149}}
Visualizing:
df = pd.DataFrame(_)
df.plot(kind='pie', subplots=True)
plt.show()
Output:
This interface looks great to me. Let's see if we can move towards having the visualized output be part of the individual analyzer classes as well, rather requiring the user to pipe the data around.
Adding some easy visualization functionality seems legit to me, though I think that it warrants some thinking about. In general, I think there's room to improve the output of these functions. These (generally 3D) dictionaries are useful data structures, but require transformation for the kinds of contexts we can imagine our users will be working in (DataFrame, visualizations, etc.). My plan is to create an issue on this topic and tackle it during the first of our summer sprints.
Closing in favor of #170.
Performing analysis on a corpus with
gender_frequency
is tricky. We get document-level results, which is great, but I had to use statistics.mean to hack together averages out of the dictionaries that get outputted. I would suggest making these aggregated, corpus-level results easy to access! This is a problem forpronoun_frequency
and forcorpus_subject_object_freq
.We might also think about graphing these aggregated results, since this data pops pretty well!