c-w / gutenberg

A simple interface to the Project Gutenberg corpus.
Apache License 2.0
320 stars 60 forks source link

UnicodeEncodeError: 'ascii' codec can't encode character #122

Closed ericleasemorgan closed 5 years ago

ericleasemorgan commented 5 years ago

How do I resolve UnicodeEncodeError?

I have the following Python 3.x script, and from the terminal it works 100% of the time:

#!/usr/local/anaconda/bin/python

# require
from gutenberg.acquire import load_etext
from gutenberg.cleanup import strip_headers
import sys

# get input
gid = int( sys.argv[ 1 ] )

# do the work and done
etext = strip_headers( load_etext( gid ) ).strip()
print( etext )
exit()

From what I can tell, etext is always a type str.

Unfortunately, sometimes the script returns an error when running under a CGI interface. For example, the following URL returns exactly what I desire:

http://dh.crc.nd.edu/sandbox/gutenberg/cgi-bin/get.cgi?gid=1497

On the other hand, the following doesn't really return anything:

http://dh.crc.nd.edu/sandbox/gutenberg/cgi-bin/get.cgi?gid=205

My log contains the following error:

Traceback (most recent call last): File "./bin/get.py", line 24, in print( etext ) UnicodeEncodeError: 'ascii' codec can't encode character '\u2019' in position 2074: ordinal not in range(128)

In my Python script I can add encoding, like this:

print( etext.encode( 'utf8' ) )

When I do so the CGI script always returns something, but the result is always a textual binary stream:

b'WALDEN\n\n\n\n\nand\n\n\n\nON THE DUTY OF CIVIL DISOBEDIENCE\n\n\n\n...

Do y'all have any ideas how I can resolve this issue?

c-w commented 5 years ago

I'm not too familiar with CGI + Python, but given that the script works fine in the terminal, I'd hypothesize that the CGI output environment may specify an ASCII-only encoding. Perhaps this is something you could take a look into?

More generally, I'd recommend to deploy a copy of gutenberg-http which is a web service on top of gutenberg and should function pretty well out of the box.

There's also a publicly accessible instance that I host: https://gutenberg.justamouse.com/ (no guarantees on uptime since for now I'm just running this on a single VM).

c-w commented 5 years ago

Resolving since this seems to be more of an issue with the CGI server setup than the Gutenberg library. Please feel free to reopen if you have any further questions or open an issue on gutenberg-http if you face any problems with that project. Thanks!