c-w / gutenberg

A simple interface to the Project Gutenberg corpus.
Apache License 2.0
320 stars 60 forks source link

load_etext() returns accent-pruned version of text #102

Closed lissyx closed 6 years ago

lissyx commented 6 years ago

I came accross that issue, I am not sure if it's a bug in the lib or some process on gutenberg itself. eBook with id 10160 can be seen at http://www.gutenberg.org/cache/epub/10160/pg10160.txt and it does contains accents. Hosted on the cache there are two versions of the text file:

Now, I am not sure if it's expected or not, but the one named 10160.txt does not contain any accent when the 10160-8.txt file do contains them. Changing the order of the extension in https://github.com/c-w/gutenberg/blob/e76d96eca4b0ab4b57fc146dbd83fc64e8b4eeeb/gutenberg/acquire/text.py#L66 help working around the issue.

c-w commented 6 years ago

Hi @lissyx. Thanks for reaching out and for reporting this.

Do you think it would be worth to change the order of the extensions by default? Are there any cases where it wouldn't be beneficial to load the -8 version of the text?

If changing the order is the right way to go: would you mind making a pull request for this? Thanks!

lissyx commented 6 years ago

Thanks @c-w for the quick reply. Part of the problem here, is that I have no idea if it's an expected behavior or not, and I could not find any documentation related to project gutenberg. I extracted a random 1000 french ebooks dataset, and so far, only one exposed this behavior. Do you know what the -8 and -0 are for? I'd be happy to make a PR once we know for sure the proper fix :)

hugovk commented 6 years ago

File formats other than plain text will have a format-designator appended to the filename, as well as an appropriate file extension. The following list indicates the most common formats likely to be found at Project Gutenberg:

Plain text       12345.txt          12345.zip    (encoding: us-ascii)
8-bit plain text 12345-8.txt        12345-8.zip  (encodings: iso-8859-1, windows-1252, MacRoman, ...)
Big-5            12345-5.txt        12345-5.zip  (encoding: big-5)
Unicode          12345-0.txt        12345-0.zip  (encoding: utf-8)

HTML             12345-h.htm        12345-h.zip
TeX              12345-t.tex        12345-t.zip
XML              12345-x.xml        12345-x.zip
MP3              12345-m-###.mp3    12345-m.zip
RTF              12345-r.rtf        12345-r.zip

PDF              12345-pdf.pdf      12345-pdf.zip
LIT              12345-lit.lit      12345-lit.zip
MS Word Doc      12345-doc.doc      12345-doc.zip
PDB              12345-pdb.pdb      12345-pdb.zip

https://www.gutenberg.org/files/

lissyx commented 6 years ago

Thanks @hugovk, I failed to find that. So maybe the order of extensions should be changed in favor of: -0.txt, -8.txt, .txt ?

hugovk commented 6 years ago

Yes, that sounds sensible.

MasterOdin commented 6 years ago

I'd like to probably also request we add a flag at least for preferring ascii over other sources. We may also want to document that table within this library for what you should expect from an ebook.

lissyx commented 6 years ago

@MasterOdin Would that be what you had in mind ? https://github.com/c-w/gutenberg/pull/103

No test there yet, I'll add them afterwards, just want to make sure it's good enough. I'm thinking that changing default behavior might not be the best idea in the world, but I'd like an external eye on that.

lissyx commented 6 years ago

Updated PR with tests.

lissyx commented 6 years ago

@c-w There might also be another issue hidden:

python -c 'from gutenberg.acquire import load_etext; print(load_etext(55517, refresh_cache=True)[0:1000])' 
INFO:rdflib:RDFLib Version: 4.2.2
The Project Gutenberg EBook of Correspondance, by Émile Zola

This eBook is for the use of anyone anywhere in the United States and most
other parts of the world at no cost and with almost no restrictions
whatsoever.  You may copy it, give it away or re-use it under the terms of
the Project Gutenberg License included with this eBook or online at
www.gutenberg.org.  If you are not located in the United States, you'll have
to check the laws of the country where you are located before using this ebook.

Title: Correspondance
       Lettres de jeunesse

Author: Émile Zola

Release Date: September 10, 2017 [EBook #55517]

Language: French

Character set encoding: UTF-8

*** START OF THIS PROJECT GUTENBERG EBOOK CORRESPONDANCE ***

Advertised as UTF-8, but the content is kind of broken indeed. Digging again in load_etext, it seems requests thinks the encoding is "ISO-8859-1" when the file really is UTF-8. So far, forcing response.encoding = "utf-8" fixes it for me locally.

Given that the writing of the file uses UTF-8 after, I'd be tempted to add response.encoding = "utf-8" to my PR.

lissyx commented 6 years ago

I can confirm that multiple mirrors do not set any charset informations when downloading, while the main website does:

$ curl -L -v http://www.gutenberg.org/files/55517/55517-0.txt 2>&1 | grep Content-Type
< Content-Type: text/plain; charset=UTF-8
$ curl -L -v http://www.mirrorservice.org/sites/ftp.ibiblio.org/pub/docs/books/gutenberg/5/5/5/1/55517/55517-0.txt 2>&1 | grep Content-Type
< Content-Type: text/plain
$ curl -L -v http://aleph.gutenberg.org/5/5/5/1/55517/55517-0.txt 2>&1 | grep Content-Type
< Content-Type: text/plain
MasterOdin commented 6 years ago

Using the apparent_encoding still sets things right though?

lissyx commented 6 years ago

@MasterOdin Yes, it seems to be okay doing it like that.