c-w / gutenberg

A simple interface to the Project Gutenberg corpus.
Apache License 2.0
322 stars 59 forks source link

strip_headers failure cases #27

Closed ikarth closed 8 years ago

ikarth commented 8 years ago

Here's a list of files that have some degree of failure in strip_headers(): https://gist.github.com/ikarth/49e10e7b5a66fe8d6732

This was made by grepping over the English-language, public domain text files from the subset corpus I've been assembling (April 2010 DVD + another archive from 2013) for the first mention of "Project Gutenberg".

There's a few false positives like "THE COMPLETE PROJECT GUTENBERG WORKS OF GEORGE MEREDITH" and a few transcriber notes, but the most common bits to stay untouched is the illustrated books that have a note about an html version with pictures being available.

Removing or not removing some of these may be an aesthetic call, but I figured the data might be useful.

c-w commented 8 years ago

Hi Isaac. Thanks for compiling this list. That is very helpful.

I'll look into these cases and see if the strip_headers function can be adapted to better cover the texts. Apparently the Project Gutenberg world has changed quite a bit since the original strip_headers algorithm was written!

c-w commented 8 years ago

Attempting to find some common patterns in the failure cases:

$ cat strip_headers_failure_cases.txt \
| cut -d':' -f2- \
| sed 's/  */ /g' | sed 's/^ *//' \
| cut -d' ' -f1-6 \
| tr '[:upper:]' '[:lower:]' \
| sort | uniq -c | sort -n \
| sed 's/^ *//' \
| grep '^[0-9][0-9][0-9]*'

3510 note: project gutenberg also has an                                                                                                            
48 and the project gutenberg online distributed                                                                                                     
41 project gutenberg editions                                                                                                                       
32 mary meehan, and the project gutenberg                                                                                                           
21 http://pglaf.org/fundraising. contributions to the project gutenberg                                                                             
18 [where available, project gutenberg e-text numbers                                                                                               
17 project gutenberg also has an html                                                                                                               
15 *** end of this project gutenberg                                                                                                                
13 project gutenberg distributed proofreaders                                                                                                       
12 note: project gutenberg also has the                                                                                                             
10 project gutenberg has the other two                                                                                                              
10 novels before her death in 1942.                                                                                                                 
10 note: project gutenberg also has volume
ikarth commented 8 years ago

On that note, these are files where strip_headers removes the entire file:

12538.txt 13882-0.txt 15700-0.txt 16350-0.txt 22261.txt 22515-0.txt 22535-0.txt 26535-0.txt 33525.txt 33956.txt 34186-0.txt 34572.txt 35159.txt 35190.txt 35242.txt 35589-0.txt 36127.txt 36203.txt 36261.txt 36285-0.txt 36530.txt 36598.txt 3704.txt 37713.txt 37804.txt 37916.txt 37959.txt 37960-0.txt 37985.txt 38003.txt 38077.txt 38507-0.txt 38507.txt 39338.txt 39397-0.txt 39526-0.txt 39961.txt 40161-0.txt 40815-0.txt 42251-0.txt

c-w commented 8 years ago

Some more key phrases that mark the start of texts extracted from the last batch of reports:

Transcriber's notes are generally too diverse to safely remove; examples include: