pyodbc / unixODBC problem reading nvarchar

GoogleCodeExporter commented 9 years ago

What steps will reproduce the problem?
1. on custom built Linux (kernel 2.6.9), build pyodbc from source code, using 
python 2.6 and unixODBC 2.3.0
2. in Netezza database, setup a table with column with datatype nvarchar, and a 
column with datatype varchar
3. insert one row of values, say ('A', 'B') into the table
4. run sql query using pyodbc, select * from test_table;

The test script looks like this:

import pyodbc

def main():

    conn = pyodbc.connect("DRIVER={NetezzaSQL};SERVER=server.example.com;DATABASE=mydatabase;UID=myusername;PWD=password", ansi=True)
    sql = 'select * from MY_TABLE'
    cur = conn.cursor()
    cur.execute(sql)
    row = cur.fetchone()
    print row

if __name__ == "__main__":
    main()

What is the expected output? What do you see instead?
I expect to see (u'A', 'B') or ('A', 'B'), but I actually see (u'', 'B').

If I insert more characters, say ('AAAA', 'BBBB'), then I got this when 
querying using pyodbc:
(u'\u4141\u4141', 'BBBB')

While I can get the expected ('AAAA', 'BBBB') using either isql or nzsql, or 
even pyodbc on windows.

What version of the product are you using? On what operating system?

Linux 2.6.9 64 bit, custom built
python 2.6
pyodbc 2.1.8 (custom built)
unixODBC 2.3.0
Netezza odbc driver 64 bit

Please provide any additional information below.
I haven't tried this with other type of database, as I don't have access to 
another database type at the moment.

On the other hand, I can use isql (binary tool that came with unixODBC) and 
nzsql (binary tools that came with Netezza database) to do the same query, and 
see the expected output.

Thanks

Original issue reported on code.google.com by jbao...@gmail.com on 21 Mar 2011 at 9:06

GoogleCodeExporter commented 9 years ago

The string u'\u4141\u4141' indicates that a UTF-8 string is being received when 
UTF-16 is expected:
>>> u'AAAA'.encode('UTF-8').decode('UTF-16')
u'\u4141\u4141'

This would happen in the following scenario:
1) Netezza ODBC driver reports the column's SQL type to be "wvarchar" 
(represented by the ODBC contant SQL_WVARCHAR)
2) pyodbc would then request to read the column data as the C language type 
"wchar_t*" (represented by the ODBC constant SQL_C_WCHAR)
3) Netezza ODBC driver then returns UTF-8 data (wrong! wchar_t* in C is UTF-16 
or UCS4 depending on compiler/platform, never UTF-8)

While step 3 should be regarded as a bug in the Netezza ODBC driver, I wonder 
if this is configurable behaviour: looking at Netezza's sample odbcinst.ini 
shown in this forum post:
http://www.netezzacommunity.com/message/5801
... I see the following option:
UnicodeTranslationOption = utf8

Do you have a setting like this?  I'd suggest changing this to "utf16".

Original comment by lukedell...@gmail.com on 22 Mar 2011 at 1:20

GoogleCodeExporter commented 9 years ago

Hey, you got it exactly right. After I changed the setting 
UnicodeTranslationOption = utf16, nvarchar data does show up correctly.

However, would this cause any other issues with other type of data that might 
actually need to be translated using utf8, instead of utf16?

Thanks

Original comment by jbao...@gmail.com on 22 Mar 2011 at 1:58

GoogleCodeExporter commented 9 years ago

I do not expect you'd have problems with other data types.  My guess is that 
this option attempts to cater for non-unicode-aware clients, but if you're only 
using pyodbc as a client then setting this to utf16 should be fine.

Original comment by lukedell...@gmail.com on 22 Mar 2011 at 3:58

GoogleCodeExporter commented 9 years ago

Original comment by mkleehammer on 20 May 2011 at 7:47

Changed state: NoFix

GoogleCodeExporter commented 9 years ago

I just looked at this issue more closely.

What the Netezza ODBC driver is doing is returning SQL_WVARCHAR regardless of 
whether it is using UTF-8 or UTF-16 encoding, and regardless of whether the 
result is unicode.

In PYODBC, there actually is support for receiving Unicode through UTF-8, 
however the switch statement in GetDataString() in getdata.cpp that does this 
only checks the flag 'unicode_results' when the driver had reported SQL_CHAR, 
SQL_VARCHAR, SQL_LONGVARCHAR, SQL_GUID, and SQL_SS_XML.  If I move SQL_WVARCHAR 
up into that section of the switch statement, then this bug goes away.  
However, the whole logic section only seems to apply for Python < 3.0 (not sure 
why that is; does Python 3.0 somehow figure out Unicode on its own?)

Anyway, attached is my patched getdata.cpp which fixes this problem.  The most 
important question, however, is does this work for all ODBC drivers?

Original comment by cpar...@gmail.com on 19 Jan 2012 at 9:10

Attachments:

getdata.cpp

GoogleCodeExporter commented 9 years ago

The fundamental problem is the ODBC does *not* support UTF8 or UTF16.  The 
specification only allows for 1-byte ASCII/latin1 and 2-byte UCS2.  The UTF 
encodings are actually variable width with a minimum size of 8-bits or 16-bits. 
 They can be up to 4 or 5 characters wide.

I don't know why driver writers can't understand this.  Their job is to convert 
whatever the database sends into ASCII or UCS2.  Client software, such as 
pyodbc, simply cannot guess what encoding is being used -- instead the CHAR vs 
WCHAR is how the driver tells the client what to expect.

The unicode_results flag is used to have pyodbc convert actual ANSI results to 
Unicode before returning them to the user, not to expect Unicode when the 
driver says it is ANSI.

I have considered adding a different setting for an encoding such as UTF8, in 
particular so that MySQL users can also tell pyodbc that their driver broken 
and will lie about the result type.  That may be the fix you need.

In the meantime, if you can configure the driver to use UCS2, it should work.  
Is that possible?

Original comment by mkleehammer on 20 Jan 2012 at 3:11

GoogleCodeExporter commented 9 years ago

"Change UnicodeTranslationOption to "utf16" in odbcinst.ini" is actionable. 
"Configure the driver to use UCS2" is more of a vision statement. I'm a simple 
programmer!

I'll see if we can get IBM to fix the driver, but is pyodbc fixing it? Has a 
fix been released?

Original comment by ionFree...@gmail.com on 27 Jun 2014 at 7:07

djhenderson / pyodbc

pyodbc / unixODBC problem reading nvarchar #165