c-w / gutenberg

A simple interface to the Project Gutenberg corpus.
Apache License 2.0
322 stars 59 forks source link

Added support for downloading texts from more languages #19

Closed andrewyang96 closed 9 years ago

andrewyang96 commented 9 years ago

The current version uses ISO-8859-1 as the encoding to download Gutenberg ebooks. However this restricts the languages that this library can download (see here).

For example, the current version cannot download Chinese texts. The changes I have made fixes that by changing the requests.get encoding to utf-8. Below is a script that downloads Journey to the West, a Chinese ebook, and saves it using the appropriate encoding.

from gutenberg.acquire import load_etext
from gutenberg.cleanup import strip_headers

OUTFILE = "<insert outfile destination here>"
text = strip_headers(load_etext(23962)).strip()
with open(OUTFILE, 'w') as f:
    f.write(text)
c-w commented 9 years ago

Thanks for the pull request. Very useful change. Could you add a unit test for the utf-8 load_etext case?