dhmit / gender_analysis

A toolkit for analyzing gendered language across sets of documents
BSD 3-Clause "New" or "Revised" License
11 stars 5 forks source link

Adding in gutenberg_stripper #127

Closed kenalba closed 3 years ago

kenalba commented 3 years ago

Added functionality to strip gutenberg headers and footers from any texts with "project gutenberg" in their body. This won't do anything to any texts that don't have those headers and footers, so it should be okay that the conditional is handwavey. Also updated all tests to account for the updated frequencies.

kenalba commented 3 years ago

Hold off on merging this, I suppose - this code passes when I run it on my machine, and I'm not sure what the problem is.

The larger issue is that adding the gutenberg_stripper appears to double the time it takes to run our tests, which might mean this solution is a bad one. I'll spend some time with this this weekend.

c.f.


>>> from gender_analysis import document
>>> from pathlib import Path
>>> from gender_analysis import common
>>> document_metadata = {'author': 'Austen, Jane', 'title': 'Persuasion', 'date': '1818', 'filename': 'austen_persuasion.txt', 'filepath': Path(common.TEST_DATA_PATH, 'sample_novels', 'texts', 'austen_persuasion.txt')}
>>> austen2 = document.Document(document_metadata)
>>> type(austen2.text)
<class 'str'>
>>> len(austen2.text)
475233
kenalba commented 3 years ago

I closed this PR because I think there are too many open questions, between intermittent testing failures and some licensing questions. I'll post more in #112 .