catseye / Guten-gutter

Strips boilerplate from Project Gutenberg text files
The Unlicense
16 stars 0 forks source link

Known failure texts #1

Open ikarth opened 9 years ago

ikarth commented 9 years ago

Just a short list of Project Gutenberg files that are known failures:

00ws110.txt (and a lot of the other early William Shakespeare texts.) zncli10.txt zen10.txt 25019-0.txt 25012-8.txt

As a general rule, numbered files seem to be standardized enough to pass, mostly, while the earlier files with letters in the file name are more hit-or-miss. Non-English files also tend to fail for some reason. Also, books with names too long to fit on one line tend to fail.

ikarth commented 9 years ago

The non-English books failing may be because the ones I was looking at had really long German titles, because the Chinese ones I glanced at seem to be fine.

cpressey commented 9 years ago

Handling titles that extend over 2 lines should definitely be fixed, if it can be (it probably can.)

I suspect some of these listed might be "old" texts. zncli10.txt for example, I eventually did find, but I had to use a web search instead of Gutenberg's search. It does indeed fail on it because the "produced by" regexp is inadquate.

I'll see what I can do for it, and the other ones, shortly.

Is the boilerplate on non-English books in English? If not, ... there's going to be tough going there and I may just cop out and disclaim that this tool is only suitable for English works...

ikarth commented 9 years ago

Yes, these are from the April 2010 DVD. I believe that Gutenberg has modernized some of the old texts since then, but they seem to frown on me downloading the entire site at once.

The boilerplate on non-English works appears to be universally in English, at least for the ones I looked at. If this holds true, I imagine that it would be substantially easier to detect, say, where a Chinese or Cyrillic text begins and ends.

cpressey commented 8 years ago

I realized I could run the script on every text file I've downloaded from Gutenberg like so

cd my_gutenberg_texts
mkdir tmp
guten-gutter --output-dir=tmp/ *

and it will report which ones it fails on. So I'll add them here (I've renamed them for my own convenience but retained the original filename at the end):

ProducedByStripper failed to clean 'A_Princess_of_Mars_pg62.txt'
ProducedByStripper failed to clean 'Around_the_World_in_80_Days_pg103.txt'
ProducedByStripper failed to clean 'The_Island_of_Doctor_Moreau_pg159.txt'
ProducedByStripper failed to clean 'The_Time_Machine_pg35.txt'
ProducedByStripper failed to clean 'War_and_Peace_pg2600.txt'
cpressey commented 8 years ago

zen10.txt = http://www.gutenberg.org/cache/epub/34/pg34.txt = "This is a COPYRIGHTED Project Gutenberg eBook, Details Below" = WONTFIX... or at least NOTINCLINEDTOFIX... because I think the primary purpose of this tool is to extract the public domain contents of PG texts. I'll clarify this in the README.

ikarth commented 8 years ago

Note, for those wishing to sort out only the public domain texts: copyright information is included in the metadata in Project Gutenberg's RDF catalog: https://www.gutenberg.org/wiki/Gutenberg:Feeds

cpressey commented 8 years ago

Have added some commits (https://github.com/catseye/Guten-gutter/commit/c593b4dd6cf8f28842c63a1d20c89f18b0ec58e3 , https://github.com/catseye/Guten-gutter/commit/35c6f40f88cc9b117409e70224f14ecce81a2917) that handle zncli10.txt and pg159.txt (and maybe others like them, as a side-effect.)

For pg62.txt and others listed in https://github.com/catseye/Guten-gutter/issues/1#issuecomment-158035094 the problem is that they don't have a "Produced by" line. In fact, they are cleaned fine by the script, it's just that they produce this warning message when it tries to remove this line, and it can't find it.

Not 100% sure of the best way to handle this, because there is no general way to distinguish between having no "produced by" line, and having a "produced line" in a format that we don't recognize.