Closed kenalba closed 3 years ago
Hold off on merging this, I suppose - this code passes when I run it on my machine, and I'm not sure what the problem is.
The larger issue is that adding the gutenberg_stripper appears to double the time it takes to run our tests, which might mean this solution is a bad one. I'll spend some time with this this weekend.
c.f.
>>> from gender_analysis import document
>>> from pathlib import Path
>>> from gender_analysis import common
>>> document_metadata = {'author': 'Austen, Jane', 'title': 'Persuasion', 'date': '1818', 'filename': 'austen_persuasion.txt', 'filepath': Path(common.TEST_DATA_PATH, 'sample_novels', 'texts', 'austen_persuasion.txt')}
>>> austen2 = document.Document(document_metadata)
>>> type(austen2.text)
<class 'str'>
>>> len(austen2.text)
475233
I closed this PR because I think there are too many open questions, between intermittent testing failures and some licensing questions. I'll post more in #112 .
Added functionality to strip gutenberg headers and footers from any texts with "project gutenberg" in their body. This won't do anything to any texts that don't have those headers and footers, so it should be okay that the conditional is handwavey. Also updated all tests to account for the updated frequencies.