dhmit / gender_analysis

A toolkit for analyzing gendered language across sets of documents
BSD 3-Clause "New" or "Revised" License
11 stars 5 forks source link

Project Gutenberg headers and footers remain in our test corpus #109

Closed kenalba closed 4 years ago

kenalba commented 4 years ago

Right now, the quickstart guide (and a lot of our initial use-cases) uses texts grabbed from Project Gutenberg. To distribute these texts, we legally have to keep the headers and footers on the files.

We should strip the headers and footers out when performing any actual analysis, however - maybe when loading Documents? There's a function in gutenberg_loader that does this, but that file isn't in the master branch at the moment.