Closed ikarth closed 8 years ago
Hi Isaac. Thanks for compiling this list. That is very helpful.
I'll look into these cases and see if the strip_headers
function can be adapted to better cover the texts. Apparently the Project Gutenberg world has changed quite a bit since the original strip_headers
algorithm was written!
Attempting to find some common patterns in the failure cases:
$ cat strip_headers_failure_cases.txt \
| cut -d':' -f2- \
| sed 's/ */ /g' | sed 's/^ *//' \
| cut -d' ' -f1-6 \
| tr '[:upper:]' '[:lower:]' \
| sort | uniq -c | sort -n \
| sed 's/^ *//' \
| grep '^[0-9][0-9][0-9]*'
3510 note: project gutenberg also has an
48 and the project gutenberg online distributed
41 project gutenberg editions
32 mary meehan, and the project gutenberg
21 http://pglaf.org/fundraising. contributions to the project gutenberg
18 [where available, project gutenberg e-text numbers
17 project gutenberg also has an html
15 *** end of this project gutenberg
13 project gutenberg distributed proofreaders
12 note: project gutenberg also has the
10 project gutenberg has the other two
10 novels before her death in 1942.
10 note: project gutenberg also has volume
On that note, these are files where strip_headers removes the entire file:
12538.txt 13882-0.txt 15700-0.txt 16350-0.txt 22261.txt 22515-0.txt 22535-0.txt 26535-0.txt 33525.txt 33956.txt 34186-0.txt 34572.txt 35159.txt 35190.txt 35242.txt 35589-0.txt 36127.txt 36203.txt 36261.txt 36285-0.txt 36530.txt 36598.txt 3704.txt 37713.txt 37804.txt 37916.txt 37959.txt 37960-0.txt 37985.txt 38003.txt 38077.txt 38507-0.txt 38507.txt 39338.txt 39397-0.txt 39526-0.txt 39961.txt 40161-0.txt 40815-0.txt 42251-0.txt
Some more key phrases that mark the start of texts extracted from the last batch of reports:
38003
http://www.pgdp.net33956
Internet Archive/American Libraries.)40815
Internet Archive/Canadian Libraries.)35190
material from the Google Print project.)36261
by the Internet Archive)35589
public domain material from the Internet Archive.)36530
The Internet Library of Early Journals38507
http://gallica.bnf.fr)39397
http://archive.orgTranscriber's notes are generally too diverse to safely remove; examples include:
Here's a list of files that have some degree of failure in strip_headers(): https://gist.github.com/ikarth/49e10e7b5a66fe8d6732
This was made by grepping over the English-language, public domain text files from the subset corpus I've been assembling (April 2010 DVD + another archive from 2013) for the first mention of "Project Gutenberg".
There's a few false positives like "THE COMPLETE PROJECT GUTENBERG WORKS OF GEORGE MEREDITH" and a few transcriber notes, but the most common bits to stay untouched is the illustrated books that have a note about an html version with pictures being available.
Removing or not removing some of these may be an aesthetic call, but I figured the data might be useful.