Closed kenalba closed 3 years ago
Merging #147 (b9a5a23) into master (9bafdee) will increase coverage by
5.23%
. The diff coverage is42.85%
.
@@ Coverage Diff @@
## master #147 +/- ##
==========================================
+ Coverage 51.16% 56.40% +5.23%
==========================================
Files 12 12
Lines 1675 1468 -207
Branches 364 362 -2
==========================================
- Hits 857 828 -29
+ Misses 757 586 -171
+ Partials 61 54 -7
Impacted Files | Coverage Δ | |
---|---|---|
gender_analysis/analysis/dunning.py | 30.00% <0.00%> (ø) |
|
gender_analysis/analysis/dependency_parsing.py | 12.14% <6.25%> (-1.57%) |
:arrow_down: |
gender_analysis/analysis/instance_distance.py | 32.60% <8.69%> (-1.98%) |
:arrow_down: |
gender_analysis/analysis/gender_adjective.py | 45.27% <44.89%> (+14.89%) |
:arrow_up: |
gender_analysis/analysis/gender_frequency.py | 59.33% <57.46%> (+9.69%) |
:arrow_up: |
gender_analysis/document.py | 84.88% <100.00%> (+0.35%) |
:arrow_up: |
gender_analysis/gender.py | 100.00% <0.00%> (+2.63%) |
:arrow_up: |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact)
,ø = not affected
,? = missing data
Powered by Codecov. Last update 9bafdee...b9a5a23. Read the comment docs.
The most basic memoization we can use. There's a thornier problem here to tackle between the usage of our ad hoc tokenizer, which strips out punctuation and capitalization, and the nltk tokenizer.
The NLTK tokenizer is a good deal slower, but it's more accurate and we need to use it when we do any kind of pos analysis. I'd ideally like to move us over to a system where we use the NLTK tokenizer for our initial tokenization and maybe store a punctuation-stripped, all-lowercase version of that for Doing Operations on. We might also think about memoizing the POS tagged corpus.
Anyway, this is fast and easy and it doesn't break anything.
... though notably this is branched off of ExpandGenderSupport and therefore should be merged after that.