Open ikarth opened 9 years ago
The non-English books failing may be because the ones I was looking at had really long German titles, because the Chinese ones I glanced at seem to be fine.
Handling titles that extend over 2 lines should definitely be fixed, if it can be (it probably can.)
I suspect some of these listed might be "old" texts. zncli10.txt for example, I eventually did find, but I had to use a web search instead of Gutenberg's search. It does indeed fail on it because the "produced by" regexp is inadquate.
I'll see what I can do for it, and the other ones, shortly.
Is the boilerplate on non-English books in English? If not, ... there's going to be tough going there and I may just cop out and disclaim that this tool is only suitable for English works...
Yes, these are from the April 2010 DVD. I believe that Gutenberg has modernized some of the old texts since then, but they seem to frown on me downloading the entire site at once.
The boilerplate on non-English works appears to be universally in English, at least for the ones I looked at. If this holds true, I imagine that it would be substantially easier to detect, say, where a Chinese or Cyrillic text begins and ends.
I realized I could run the script on every text file I've downloaded from Gutenberg like so
cd my_gutenberg_texts
mkdir tmp
guten-gutter --output-dir=tmp/ *
and it will report which ones it fails on. So I'll add them here (I've renamed them for my own convenience but retained the original filename at the end):
ProducedByStripper failed to clean 'A_Princess_of_Mars_pg62.txt'
ProducedByStripper failed to clean 'Around_the_World_in_80_Days_pg103.txt'
ProducedByStripper failed to clean 'The_Island_of_Doctor_Moreau_pg159.txt'
ProducedByStripper failed to clean 'The_Time_Machine_pg35.txt'
ProducedByStripper failed to clean 'War_and_Peace_pg2600.txt'
zen10.txt = http://www.gutenberg.org/cache/epub/34/pg34.txt = "This is a COPYRIGHTED Project Gutenberg eBook, Details Below" = WONTFIX... or at least NOTINCLINEDTOFIX... because I think the primary purpose of this tool is to extract the public domain contents of PG texts. I'll clarify this in the README.
Note, for those wishing to sort out only the public domain texts: copyright information is included in the metadata in Project Gutenberg's RDF catalog: https://www.gutenberg.org/wiki/Gutenberg:Feeds
Have added some commits (https://github.com/catseye/Guten-gutter/commit/c593b4dd6cf8f28842c63a1d20c89f18b0ec58e3 , https://github.com/catseye/Guten-gutter/commit/35c6f40f88cc9b117409e70224f14ecce81a2917) that handle zncli10.txt and pg159.txt (and maybe others like them, as a side-effect.)
For pg62.txt and others listed in https://github.com/catseye/Guten-gutter/issues/1#issuecomment-158035094 the problem is that they don't have a "Produced by" line. In fact, they are cleaned fine by the script, it's just that they produce this warning message when it tries to remove this line, and it can't find it.
Not 100% sure of the best way to handle this, because there is no general way to distinguish between having no "produced by" line, and having a "produced line" in a format that we don't recognize.
Just a short list of Project Gutenberg files that are known failures:
00ws110.txt (and a lot of the other early William Shakespeare texts.) zncli10.txt zen10.txt 25019-0.txt 25012-8.txt
As a general rule, numbered files seem to be standardized enough to pass, mostly, while the earlier files with letters in the file name are more hit-or-miss. Non-English files also tend to fail for some reason. Also, books with names too long to fit on one line tend to fail.