Build text cleaner for document initialization

kenalba commented 3 years ago

We need a function, preferably in common, that can handle a handful of cleaning tasks on what becomes Document.text s. This issue contains and supercedes issues 108 and 109.

This function should probably occur between _load_document_text() and document.text assignment.

kenalba commented 3 years ago

@samimak37 , who rules, has fixed issue #108, which just leaves #109 - the Gutenberg headers and footers.

We have 3 options here:

1) Write our own header+footer stripper. This seems like it'll take more time than we want it to, given that Gutenberg isn't the main thing we're working with here. 2) Premade solution 1: https://github.com/kiasar/gutenberg_cleaner . Simple, lightweight package we can include that does one thing - removes headers and footers. This seems best for Alpha. 3) Premade solution 2: https://github.com/c-w/gutenberg . This is what gutenberg_loader uses, which is nice. On the other hand, it requires BSD-DB, which isn't in the Python standard library for Python 3 and up.

I've added a branch that does 2 (https://github.com/dhmit/gender_analysis/tree/gutenberg_stripper) but it breaks some of our Coverage tests. Updating those test values can't really be done until smartquote_handler and this update are merged.

kenalba commented 3 years ago

So yes! I'm looking in particular at section 1.C. of the Gutenberg license, which states:

1.C.  The Project Gutenberg Literary Archive Foundation ("the Foundation"
or PGLAF), owns a compilation copyright in the collection of Project
Gutenberg-tm electronic works.  Nearly all the individual works in the
collection are in the public domain in the United States.  **If an
individual work is in the public domain in the United States and you are
located in the United States, we do not claim a right to prevent you from
copying, distributing, performing, displaying or creating derivative
works based on the work as long as all references to Project Gutenberg
are removed.**  Of course, we hope that you will support the Project
Gutenberg-tm mission of promoting free access to electronic works by
freely sharing Project Gutenberg-tm works in compliance with the terms of
this agreement for keeping the Project Gutenberg-tm name associated with
the work.  You can easily comply with the terms of this agreement by
keeping this work in the same format with its attached full Project
Gutenberg-tm License when you share it without charge with others.

All of the texts in our sample corpora are in the public domain and we're located in the United States. I'd like to credit Project Gutenberg in the docs somewhere, but from a performance perspective there's a lot to be said for running this cleaner once on our corpora and distributing the stripped versions. Thoughts?

dhmit / gender_analysis

Build text cleaner for document initialization #112