BradRuderman / pyhs2

MIT License
207 stars 108 forks source link

Unicode support #15

Closed parautenbach closed 10 years ago

parautenbach commented 10 years ago

I have some Unicode data (encoded as UTF-8) in HDFS: fŏŏbārbaß (or in Python, u'f\u014f\u014fb\u0101rba\xdf')

When reading (querying) the table that contains this data, the cursor's fetch returns data of type str and not unicode, as I would expect. The Unicode characters have become question marks (?, ordinal value 3f, so it isn't a problem with representation). Querying using Beeswax via Hue returns the expected result.

Is there something I need to do to get the data in the desired format, or isn't Unicode supported?

pyhs2 version: 0.4.1 Python: 2.7.6 Platforms: OS X and Ubuntu

parautenbach commented 10 years ago

It looked for a moment that the issue might be in the thrift library/package, but we're using version 0.9.1. I'm also (so far) unable to determine whether the issue is in fastbinary.c.

parautenbach commented 10 years ago

I have resolved this issue, which in the end had nothing to do with this library (not that I had any such evidence, but I logged it here as pyhs2 is where the issue surfaced). I went down several avenues, including investigating the Python implementation of the Apache Thrift project, among others. In the end I discovered that the data on the wire was different from what I expected, which meant none of the client libraries were at fault.

The mistake I made was that I forgot that Beeswax and the command-line Hive utility does not use hiveserver2. Hence, those could neither be used as reference point, nor as any indication that the server-side functionality was working fine.

To cut to the chase, the problem was that hiveserver2 wasn't decoding my Unicode data as UTF-8. Consequently, one needs to set an additional JVM property, which is -Dfile.encoding=UTF-8.

It would be great if someone with a deeper understanding could shed some light on the reasons for this behaviour, as it was my understanding that Java defaults to UTF-8 for the string data types, but it seems like it's ISO-8859-1, i.e. Latin-1. I can also find no other way for setting this (e.g. system-wide), or let it use the system's default (e.g. if it's Ubuntu or some other Linux variant, the values in /etc/default/locale.

One could also argue that (in my case, in the Cloudera Manager) that hiveserver2 must have an explicit configuration for this, or just be started with the correct system property, as HDFS assumes UTF-8 anyway.

Disclaimer: I haven't investigated the hiveserver2 source code.