Closed hugovk closed 8 years ago
Here's a better link to the txt file (as one posted gives me a 403 error): http://www.gutenberg.lib.md.us/1/0/100/100.txt
That's unfortunate. I got the code for strip_headers
from the PAPI project. Looks like their approach doesn't handle the Shakespeare texts very well because the licence intersperses the following blurb into the text:
<<THIS ELECTRONIC VERSION OF THE COMPLETE WORKS OF WILLIAM
SHAKESPEARE IS COPYRIGHT 1990-1993 BY WORLD LIBRARY, INC., AND IS
PROVIDED BY PROJECT GUTENBERG ETEXT OF ILLINOIS BENEDICTINE COLLEGE
WITH PERMISSION. ELECTRONIC AND MACHINE READABLE COPIES MAY BE
DISTRIBUTED SO LONG AS SUCH COPIES (1) ARE FOR YOUR OR OTHERS
PERSONAL USE ONLY, AND (2) ARE NOT DISTRIBUTED OR USED
COMMERCIALLY. PROHIBITED COMMERCIAL DISTRIBUTION INCLUDES BY ANY
SERVICE THAT CHARGES FOR DOWNLOAD TIME OR FOR MEMBERSHIP.>>
The problem here are the following two entries in _domain_model.text
:
The implementation of strip_headers
clears the output text whenever it encounters a line indicating the end of the licence header (such as "SERVICE THAT CHARGES FOR DOWNLOAD"). Due to the interspersed licence texts in the Shakespeare file, this means that the output is cleared too often.
While I'm working on a full fix, please find below a partial fix that correctly removes the legalese header and footer (but doesn't handle the interspersed legal disclaimers).
diff --git a/gutenberg/_domain_model/text.py b/gutenberg/_domain_model/text.py
index 9439c93..f2db8b8 100644
--- a/gutenberg/_domain_model/text.py
+++ b/gutenberg/_domain_model/text.py
@@ -38,10 +38,10 @@ TEXT_START_MARKERS = frozenset((u(_) for _ in (
"l'authorization <C3><A0> les utilizer pour preparer ce texte.",
"of the etext through OCR.",
"*****These eBooks Were Prepared By Thousands of Volunteers!*****",
- "SERVICE THAT CHARGES FOR DOWNLOAD",
"We need your donations more than ever!",
" *** START OF THIS PROJECT GUTENBERG",
"**** SMALL PRINT!",
+ '["Small Print" V.',
)))
@@ -69,7 +69,6 @@ TEXT_END_MARKERS = frozenset((u(_) for _ in (
"Ce document fut pr<C3><A9>sent<C3><A9> en lecture",
"More information about this book is at the top of this file.",
"We need your donations more than ever!",
- "<<THIS ELECTRONIC VERSION OF",
"END OF PROJECT GUTENBERG",
" End of the Project Gutenberg",
" *** END OF THIS PROJECT GUTENBERG",
NB: if you find the strip_headers
function to work sub-optimally for some texts, please do let me know so that I can add more test-cases to the library. Thanks in advance.
Thanks! And will do!
https://www.gutenberg.org/cache/epub/100/pg100.txt
The Project Gutenberg EBook of The Complete Works of William Shakespeare, by William Shakespeare
etext
contains the full text fromThe Project Gutenberg EBook of The Complete Works of William Shakespeare
up to and including*** END: FULL LICENSE ***
.But
text
only contains the text from*Project Gutenberg is proud to cooperate with The World Library*
up to and including["Small Print" V.12.08.93]
.