strip_headers failure cases

c-w / gutenberg

A simple interface to the Project Gutenberg corpus.

Apache License 2.0

322 stars 59 forks source link

strip_headers failure cases #27

Closed ikarth closed 8 years ago

ikarth commented 8 years ago

Here's a list of files that have some degree of failure in strip_headers(): https://gist.github.com/ikarth/49e10e7b5a66fe8d6732

This was made by grepping over the English-language, public domain text files from the subset corpus I've been assembling (April 2010 DVD + another archive from 2013) for the first mention of "Project Gutenberg".

There's a few false positives like "THE COMPLETE PROJECT GUTENBERG WORKS OF GEORGE MEREDITH" and a few transcriber notes, but the most common bits to stay untouched is the illustrated books that have a note about an html version with pictures being available.

Removing or not removing some of these may be an aesthetic call, but I figured the data might be useful.

c-w commented 8 years ago

Hi Isaac. Thanks for compiling this list. That is very helpful.

I'll look into these cases and see if the strip_headers function can be adapted to better cover the texts. Apparently the Project Gutenberg world has changed quite a bit since the original strip_headers algorithm was written!

c-w commented 8 years ago

Attempting to find some common patterns in the failure cases:

$ cat strip_headers_failure_cases.txt \
| cut -d':' -f2- \
| sed 's/  */ /g' | sed 's/^ *//' \
| cut -d' ' -f1-6 \
| tr '[:upper:]' '[:lower:]' \
| sort | uniq -c | sort -n \
| sed 's/^ *//' \
| grep '^[0-9][0-9][0-9]*'

3510 note: project gutenberg also has an                                                                                                            
48 and the project gutenberg online distributed                                                                                                     
41 project gutenberg editions                                                                                                                       
32 mary meehan, and the project gutenberg                                                                                                           
21 http://pglaf.org/fundraising. contributions to the project gutenberg                                                                             
18 [where available, project gutenberg e-text numbers                                                                                               
17 project gutenberg also has an html                                                                                                               
15 *** end of this project gutenberg                                                                                                                
13 project gutenberg distributed proofreaders                                                                                                       
12 note: project gutenberg also has the                                                                                                             
10 project gutenberg has the other two                                                                                                              
10 novels before her death in 1942.                                                                                                                 
10 note: project gutenberg also has volume

ikarth commented 8 years ago

On that note, these are files where strip_headers removes the entire file:

12538.txt 13882-0.txt 15700-0.txt 16350-0.txt 22261.txt 22515-0.txt 22535-0.txt 26535-0.txt 33525.txt 33956.txt 34186-0.txt 34572.txt 35159.txt 35190.txt 35242.txt 35589-0.txt 36127.txt 36203.txt 36261.txt 36285-0.txt 36530.txt 36598.txt 3704.txt 37713.txt 37804.txt 37916.txt 37959.txt 37960-0.txt 37985.txt 38003.txt 38077.txt 38507-0.txt 38507.txt 39338.txt 39397-0.txt 39526-0.txt 39961.txt 40161-0.txt 40815-0.txt 42251-0.txt

c-w commented 8 years ago

Some more key phrases that mark the start of texts extracted from the last batch of reports:

38003 http://www.pgdp.net
33956 Internet Archive/American Libraries.)
40815 Internet Archive/Canadian Libraries.)
35190 material from the Google Print project.)
36261 by the Internet Archive)
35589 public domain material from the Internet Archive.)
36530 The Internet Library of Early Journals
38507 http://gallica.bnf.fr)
39397 http://archive.org

Transcriber's notes are generally too diverse to safely remove; examples include:

[End Transcriber's notes.]
TranscriberaEuro(TM)s Note
Transcriber's Note:
Etext transcriber's note: