strip_headers strips too much of the complete works of Shakespeare

hugovk commented 8 years ago

https://www.gutenberg.org/cache/epub/100/pg100.txt

The Project Gutenberg EBook of The Complete Works of William Shakespeare, by William Shakespeare

Python 2.7.9 (default, Dec 10 2014, 12:28:03) [MSC v.1500 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> from gutenberg.acquire import load_etext
INFO:rdflib:RDFLib Version: 4.2.1
>>> from gutenberg.cleanup import strip_headers
>>> text = strip_headers(load_etext(100)).strip()
>>> len(text)
5373
>>> len(etext)
5589915

etext contains the full text from The Project Gutenberg EBook of The Complete Works of William Shakespeare up to and including *** END: FULL LICENSE ***.

But text only contains the text from *Project Gutenberg is proud to cooperate with The World Library* up to and including ["Small Print" V.12.08.93].

MasterOdin commented 8 years ago

Here's a better link to the txt file (as one posted gives me a 403 error): http://www.gutenberg.lib.md.us/1/0/100/100.txt

c-w commented 8 years ago

That's unfortunate. I got the code for strip_headers from the PAPI project. Looks like their approach doesn't handle the Shakespeare texts very well because the licence intersperses the following blurb into the text:

<<THIS ELECTRONIC VERSION OF THE COMPLETE WORKS OF WILLIAM
SHAKESPEARE IS COPYRIGHT 1990-1993 BY WORLD LIBRARY, INC., AND IS
PROVIDED BY PROJECT GUTENBERG ETEXT OF ILLINOIS BENEDICTINE COLLEGE
WITH PERMISSION.  ELECTRONIC AND MACHINE READABLE COPIES MAY BE
DISTRIBUTED SO LONG AS SUCH COPIES (1) ARE FOR YOUR OR OTHERS
PERSONAL USE ONLY, AND (2) ARE NOT DISTRIBUTED OR USED
COMMERCIALLY.  PROHIBITED COMMERCIAL DISTRIBUTION INCLUDES BY ANY
SERVICE THAT CHARGES FOR DOWNLOAD TIME OR FOR MEMBERSHIP.>>

c-w commented 8 years ago

The problem here are the following two entries in _domain_model.text:

Line 41: "SERVICE THAT CHARGES FOR DOWNLOAD"
Line 72: "<<THIS ELECTRONIC VERSION OF"

The implementation of strip_headers clears the output text whenever it encounters a line indicating the end of the licence header (such as "SERVICE THAT CHARGES FOR DOWNLOAD"). Due to the interspersed licence texts in the Shakespeare file, this means that the output is cleared too often.

While I'm working on a full fix, please find below a partial fix that correctly removes the legalese header and footer (but doesn't handle the interspersed legal disclaimers).

diff --git a/gutenberg/_domain_model/text.py b/gutenberg/_domain_model/text.py
index 9439c93..f2db8b8 100644
--- a/gutenberg/_domain_model/text.py
+++ b/gutenberg/_domain_model/text.py
@@ -38,10 +38,10 @@ TEXT_START_MARKERS = frozenset((u(_) for _ in (
     "l'authorization <C3><A0> les utilizer pour preparer ce texte.",
     "of the etext through OCR.",
     "*****These eBooks Were Prepared By Thousands of Volunteers!*****",
-    "SERVICE THAT CHARGES FOR DOWNLOAD",
     "We need your donations more than ever!",
     " *** START OF THIS PROJECT GUTENBERG",
     "****     SMALL PRINT!",
+    '["Small Print" V.',
 )))

@@ -69,7 +69,6 @@ TEXT_END_MARKERS = frozenset((u(_) for _ in (
     "Ce document fut pr<C3><A9>sent<C3><A9> en lecture",
     "More information about this book is at the top of this file.",
     "We need your donations more than ever!",
-    "<<THIS ELECTRONIC VERSION OF",
     "END OF PROJECT GUTENBERG",
     " End of the Project Gutenberg",
     " *** END OF THIS PROJECT GUTENBERG",

c-w commented 8 years ago

NB: if you find the strip_headers function to work sub-optimally for some texts, please do let me know so that I can add more test-cases to the library. Thanks in advance.

hugovk commented 8 years ago

Thanks! And will do!

c-w / gutenberg

strip_headers strips too much of the complete works of Shakespeare #25