dhmit / gender_analysis

A toolkit for analyzing gendered language across sets of documents
BSD 3-Clause "New" or "Revised" License
11 stars 5 forks source link

Tightened and more flexible testing #136

Closed kenalba closed 4 years ago

kenalba commented 4 years ago

This PR is kind of a mess, so consider this more of a review request than an actual pull request, but I spent some time honing our testing corpora and the way we treat it.

Broadly speaking, I removed the test_data folder because every document in there was already in sample_novels. Instead, now, I've created three .csv files - large_test_corpus, small_test_corpus, and tiny_test_corpus - which will only select out the texts we want to use to test from our sample_novels. This way of thinking about test corpora as metadata-first rather than file-first is, I think, more flexible in the long run. The addition of tiny_test_corpus also gives us a 4 document corpus to test our most computationally hungry functions on, which is what motivated this shift in the first place.

It did mean adding an ignore_warnings flag to the corpus generator, since we now expect that the generator won't load every text file in a directory.

After doing this, I looked to see what functions are taking a long time to test and, where possible, rewrote them to work on more compact corpora. Doing this cut the time it takes to run coverage on my machine in half.

codecov-io commented 4 years ago

Codecov Report

Merging #136 into master will decrease coverage by 1.00%. The diff coverage is 32.38%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #136      +/-   ##
==========================================
- Coverage   44.54%   43.53%   -1.01%     
==========================================
  Files          12       12              
  Lines        1623     1656      +33     
  Branches      353      365      +12     
==========================================
- Hits          723      721       -2     
- Misses        847      882      +35     
  Partials       53       53              
Impacted Files Coverage Δ
gender_analysis/analysis/dunning.py 30.29% <0.00%> (ø)
gender_analysis/document.py 82.14% <ø> (ø)
gender_analysis/analysis/gender_adjective.py 30.37% <19.23%> (-9.63%) :arrow_down:
gender_analysis/analysis/gender_frequency.py 49.63% <50.00%> (ø)
gender_analysis/corpus.py 66.66% <50.00%> (ø)
gender_analysis/gender.py 96.15% <93.33%> (ø)
gender_analysis/analysis/instance_distance.py 34.58% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 5d9882f...233ed1b. Read the comment docs.