dhmit / gender_analysis

A toolkit for analyzing gendered language across sets of documents

BSD 3-Clause "New" or "Revised" License

11 stars 5 forks source link

This PR is kind of a mess, so consider this more of a review request than an actual pull request, but I spent some time honing our testing corpora and the way we treat it.

Broadly speaking, I removed the test_data folder because every document in there was already in sample_novels. Instead, now, I've created three .csv files - large_test_corpus, small_test_corpus, and tiny_test_corpus - which will only select out the texts we want to use to test from our sample_novels. This way of thinking about test corpora as metadata-first rather than file-first is, I think, more flexible in the long run. The addition of tiny_test_corpus also gives us a 4 document corpus to test our most computationally hungry functions on, which is what motivated this shift in the first place.

It did mean adding an ignore_warnings flag to the corpus generator, since we now expect that the generator won't load every text file in a directory.

After doing this, I looked to see what functions are taking a long time to test and, where possible, rewrote them to work on more compact corpora. Doing this cut the time it takes to run coverage on my machine in half.

Codecov Report

Merging #136 into master will decrease coverage by 1.00%. The diff coverage is 32.38%.

@@            Coverage Diff             @@
##           master     #136      +/-   ##
==========================================
- Coverage   44.54%   43.53%   -1.01%     
==========================================
  Files          12       12              
  Lines        1623     1656      +33     
  Branches      353      365      +12     
==========================================
- Hits          723      721       -2     
- Misses        847      882      +35     
  Partials       53       53

Impacted Files	Coverage Δ
gender_analysis/analysis/dunning.py	`30.29% <0.00%> (ø)`
gender_analysis/document.py	`82.14% <ø> (ø)`
gender_analysis/analysis/gender_adjective.py	`30.37% <19.23%> (-9.63%)`	:arrow_down:
gender_analysis/analysis/gender_frequency.py	`49.63% <50.00%> (ø)`
gender_analysis/corpus.py	`66.66% <50.00%> (ø)`
gender_analysis/gender.py	`96.15% <93.33%> (ø)`
gender_analysis/analysis/instance_distance.py	`34.58% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 5d9882f...233ed1b. Read the comment docs.

dhmit / gender_analysis

Tightened and more flexible testing #136

Codecov Report