Closed lissyx closed 6 years ago
Hi @lissyx. Thanks for reaching out and for reporting this.
Do you think it would be worth to change the order of the extensions by default? Are there any cases where it wouldn't be beneficial to load the -8
version of the text?
If changing the order is the right way to go: would you mind making a pull request for this? Thanks!
Thanks @c-w for the quick reply. Part of the problem here, is that I have no idea if it's an expected behavior or not, and I could not find any documentation related to project gutenberg. I extracted a random 1000 french ebooks dataset, and so far, only one exposed this behavior. Do you know what the -8
and -0
are for? I'd be happy to make a PR once we know for sure the proper fix :)
File formats other than plain text will have a format-designator appended to the filename, as well as an appropriate file extension. The following list indicates the most common formats likely to be found at Project Gutenberg:
Plain text 12345.txt 12345.zip (encoding: us-ascii) 8-bit plain text 12345-8.txt 12345-8.zip (encodings: iso-8859-1, windows-1252, MacRoman, ...) Big-5 12345-5.txt 12345-5.zip (encoding: big-5) Unicode 12345-0.txt 12345-0.zip (encoding: utf-8) HTML 12345-h.htm 12345-h.zip TeX 12345-t.tex 12345-t.zip XML 12345-x.xml 12345-x.zip MP3 12345-m-###.mp3 12345-m.zip RTF 12345-r.rtf 12345-r.zip PDF 12345-pdf.pdf 12345-pdf.zip LIT 12345-lit.lit 12345-lit.zip MS Word Doc 12345-doc.doc 12345-doc.zip PDB 12345-pdb.pdb 12345-pdb.zip
Thanks @hugovk, I failed to find that. So maybe the order of extensions should be changed in favor of: -0.txt
, -8.txt
, .txt
?
Yes, that sounds sensible.
I'd like to probably also request we add a flag at least for preferring ascii over other sources. We may also want to document that table within this library for what you should expect from an ebook.
@MasterOdin Would that be what you had in mind ? https://github.com/c-w/gutenberg/pull/103
No test there yet, I'll add them afterwards, just want to make sure it's good enough. I'm thinking that changing default behavior might not be the best idea in the world, but I'd like an external eye on that.
Updated PR with tests.
@c-w There might also be another issue hidden:
python -c 'from gutenberg.acquire import load_etext; print(load_etext(55517, refresh_cache=True)[0:1000])'
INFO:rdflib:RDFLib Version: 4.2.2
The Project Gutenberg EBook of Correspondance, by Ãmile Zola
This eBook is for the use of anyone anywhere in the United States and most
other parts of the world at no cost and with almost no restrictions
whatsoever. You may copy it, give it away or re-use it under the terms of
the Project Gutenberg License included with this eBook or online at
www.gutenberg.org. If you are not located in the United States, you'll have
to check the laws of the country where you are located before using this ebook.
Title: Correspondance
Lettres de jeunesse
Author: Ãmile Zola
Release Date: September 10, 2017 [EBook #55517]
Language: French
Character set encoding: UTF-8
*** START OF THIS PROJECT GUTENBERG EBOOK CORRESPONDANCE ***
Advertised as UTF-8, but the content is kind of broken indeed. Digging again in load_etext
, it seems requests
thinks the encoding is "ISO-8859-1" when the file really is UTF-8. So far, forcing response.encoding = "utf-8"
fixes it for me locally.
Given that the writing of the file uses UTF-8 after, I'd be tempted to add response.encoding = "utf-8"
to my PR.
I can confirm that multiple mirrors do not set any charset informations when downloading, while the main website does:
$ curl -L -v http://www.gutenberg.org/files/55517/55517-0.txt 2>&1 | grep Content-Type
< Content-Type: text/plain; charset=UTF-8
$ curl -L -v http://www.mirrorservice.org/sites/ftp.ibiblio.org/pub/docs/books/gutenberg/5/5/5/1/55517/55517-0.txt 2>&1 | grep Content-Type
< Content-Type: text/plain
$ curl -L -v http://aleph.gutenberg.org/5/5/5/1/55517/55517-0.txt 2>&1 | grep Content-Type
< Content-Type: text/plain
Using the apparent_encoding still sets things right though?
@MasterOdin Yes, it seems to be okay doing it like that.
I came accross that issue, I am not sure if it's a bug in the lib or some process on gutenberg itself. eBook with id 10160 can be seen at http://www.gutenberg.org/cache/epub/10160/pg10160.txt and it does contains accents. Hosted on the cache there are two versions of the text file:
-8
: http://aleph.gutenberg.org/1/0/1/6/10160/10160-8.txtNow, I am not sure if it's expected or not, but the one named
10160.txt
does not contain any accent when the10160-8.txt
file do contains them. Changing the order of the extension in https://github.com/c-w/gutenberg/blob/e76d96eca4b0ab4b57fc146dbd83fc64e8b4eeeb/gutenberg/acquire/text.py#L66 help working around the issue.