Issue with decoding UTF8 strings from WCHAR buffers

yalwan-scee commented 7 years ago

If a field in a table contains (some, but not just any) Japanese characters, selecting these fields results in the following stacktrace:

SystemError: /builddir/build/BUILD/Python-3.5.3/Objects/listobject.c:307: bad argument to internal function

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/yalwan/.virtualenvs/fastpq/lib/python3.5/site-packages/turbodbc/exceptions.py", line 31, in wrapper
    return f(*args, **kwds)
  File "/home/yalwan/.virtualenvs/fastpq/lib/python3.5/site-packages/turbodbc/cursor.py", line 113, in fetchall
    return [row for row in self]
  File "/home/yalwan/.virtualenvs/fastpq/lib/python3.5/site-packages/turbodbc/cursor.py", line 113, in <listcomp>
    return [row for row in self]
  File "/home/yalwan/.virtualenvs/fastpq/lib/python3.5/site-packages/turbodbc/cursor.py", line 39, in __next__
    element = self.fetchone()
  File "/home/yalwan/.virtualenvs/fastpq/lib/python3.5/site-packages/turbodbc/exceptions.py", line 31, in wrapper
    return f(*args, **kwds)
  File "/home/yalwan/.virtualenvs/fastpq/lib/python3.5/site-packages/turbodbc/cursor.py", line 105, in fetchone
    result = self.result_set.fetch_row()
SystemError: <built-in method fetch_row of PyCapsule object at 0x7fc347b3d360> returned a result with an error set

Examples of Japanese strings known to cause the issue:

ソニー・コンピュータエンタテインメントソニー・ピクチャーズエンタテインメントベセスダ・ソフトワークス／ゼニマックス・アジア(株)

MathMagique commented 7 years ago

Hi Yaqub! Could you please state the database, ODBC driver, ODBC driver settings, turbodbc version, and table scheme for future reference?

yalwan-scee commented 7 years ago

turbodbc == 1.0.5 Python == 3.5.2 Database == IBM Puredata for Analytics Table schema -- it can be reproduced using an nvarchar field

---------------------- odbc settings --------------------------------- [NZSQL] Driver = /usr/local/nz/lib/libnzsqlodbc3.so Driver64 = /usr/local/nz/lib64/libnzodbc.so Setup = /usr/local/nz/lib/libnzsqlodbc3.so Setup64 = /usr/local/nz/lib64/libnzodbc.so APILevel = 1 ConnectFunctions = YYN Description = Netezza ODBC driver DriverODBCVer = 03.51 DebugLogging = false LogPath = /tmp UnicodeTranslationOption = utf8 CharacterTranslationOption = all PreFetch = 0 Socket = 33554432

MathMagique commented 7 years ago

Thanks for the info! When turbodbc 1.1 is released (sometime this weekend), you could try the prefer_unicode option to work around your problem. Still, the error message seems a bit off.

yalwan-scee commented 7 years ago

Is this in master? I'm willing to give it a go now

MathMagique commented 7 years ago

Yes, it is in master. Don't forget to update the pybin11 folder on checkout, otherwise it will not work.

yalwan-scee commented 7 years ago

I still get the same error when trying this option

MathMagique commented 7 years ago

Just double-checking: You use it as in:

>>> from turbodbc import connect, make_options
>>> options = make_options(prefer_unicode=True)
>>> connect(dsn="my_dsn", turbodbc_options=options)

The syntax for options has changed slightly with 1.1, so I just want to make sure that you did not just add the prefer_unicode flag to the connect() call.

yalwan-scee commented 7 years ago

I didn't notice this. Woul one pass a database override for the DSN as an option to make_options or as a kw arg to connect still?

MathMagique commented 7 years ago

make_options() would only be used for options that turbodbc uses internally. Anything that goes into the connection string is passed to connect().

yalwan-scee commented 7 years ago

Ok, that did not help -- now it does not even recognise the select.

turbodbc.exceptions.DatabaseError: ODBC error
state: 42000
native error code: 27
message: ERROR:  's'
error    ^ found "S" (at char 1) expecting a keyword

MathMagique commented 7 years ago

:-D Okay, perhaps that's a little endian vs. big endian problem. Is there a test database that I could use for debugging when I find the time?

yalwan-scee commented 7 years ago

http://www-01.ibm.com/support/docview.wss?uid=swg21674147

which eventually takes you here:

https://www14.software.ibm.com/webapp/iwm/web/preLogin.do?source=swg-im-ibmndn

it is an emulator (the system is a whole appliance as they use FPGA accelerators in the product).

I don't have much hope you will be able to get going with that to be honest. Do the Japanese characters in my examples not cause similar issues at postgres or mysql?

MathMagique commented 7 years ago

I don't know if they do, but it is often related to encoding schemes. Turbodbc works best with utf-8 encodings (you might actually want to check if you can set the UnicodeTranslationOption to UTF-8), and PostgreSQL and MySQL seem to work great with UTF-8 (I usually use \u2665 for testing, which is a black filled heart). But I will check with these databases as well.

yalwan-scee commented 7 years ago

Oh, I forgot I override the translation option in the DSN to utf8, because utf16 just doesn't work at all

yaxxie commented 7 years ago

@MathMagique did you find any time to have a look at this one?

MathMagique commented 7 years ago

@yaxxie Sorry, not yet. The last weeks, my priorities were focused on features, but with the Numpy parameter release out of the door, I will tackle open bugs and compatibility issues like this one.

yaxxie commented 7 years ago

Hi @MathMagique I have some information which will be much more interesting regarding this issue -- I think the title of the bug will need to change.

I managed to trigger this issue when querying a table without Japanese characters, so the problem is clearly not to do with Japanese. After a lot of digging, what I managed to find was that this issue was being triggered by nvarchar fields whose contents are uneven in length, even with just straight ascii characters

MathMagique commented 7 years ago

Thanks for the update. I'll have a look as soon as I can.

paolof89 commented 7 years ago

I confirm, I have the same System Error but no Japanese characters in the table. Is there any update on this issue?

MathMagique commented 7 years ago

Yes, there is. Japanese characters do not really matter. The thing is that the driver puts single-byte characters in UTF-8 encoding into buffers that are supposed to hold two-byte characters in UTF-16 encoding. Interpreting these UTF-8 strings as UTF-16 ones then leads to errors. I am thinking about a sane solution to this problem.

MathMagique commented 7 years ago

@paolof89 With which database are you experiencing the error? I am crossing fingers that it is not IBM Puredata for Analytics ;-)

yalwan-scee commented 7 years ago

----------- DDL and insert
drop table yalwan if exists;
create table yalwan
(
a byteint,
b smallint,
c integer,
d bigint,
e numeric(15,5),
f varchar(20),
g nvarchar(20),
h timestamp,
i bool,
j real,
k double
) distribute on random;

insert into yalwan values (22, 22, 22, 22, 22.22, 'hello', 'hello1', '2017-01-01 00:00:00', false, 22.22, 22.22 );

----------- Python shell
>>> import pyodbc
>>> conn = pyodbc.connect('DSN=mydb')
>>> conn.setdecoding(pyodbc.SQL_CHAR, encoding='utf-8') # these ones are necessary or pyodbc vomits
>>> conn.setdecoding(pyodbc.SQL_WCHAR, encoding='utf-8')
>>> conn.setdecoding(pyodbc.SQL_WMETADATA, encoding='utf8')
>>> conn.setencoding(encoding='utf-8')
>>> cursor = conn.cursor()
>>> cursor.execute('select * from yalwan')
>>> D = cursor.fetchall()
>>> D
[(22, 22, 22, 22, Decimal('22.22000'), 'hello', 'hello1', datetime.datetime(2017, 1, 1, 0, 0), False, 22.219999313354492, 22.22)]
>>> import turbodbc
>>> tconn = turbodbc.connect(dsn='mydb')
>>> tcursor = tconn.connect()
>>> tcursor = tconn.cursor()
>>> tcursor.execute('select * from yalwan')
>>> tcursor.fetchall()
[[22, 22, 22, 22, 22.22, 'hello', '敨汬ㅯ', datetime.datetime(2017, 1, 1, 0, 0), False, 22.219999313354492, 22.22]]

The above demonstrates that decoding also happens incorrectly even if the string is of even length.

paolof89 commented 7 years ago

@MathMagique I'm using a MSSQL db

MathMagique commented 7 years ago

@paolof89 Have you tried the recommended settings for MSSQL?

>>> from turbodbc import connect, make_options
>>> options = make_options(prefer_unicode=True)
>>> connect(dsn="MSSQL", turbodbc_options=options)

paolof89 commented 7 years ago

@MathMagique I'm using SqlAlchemy, I specify the drivername='mssql+turbodbc', so I do not manage directly the connection options. Is there a way to pass them through Sqlalchemy?

MathMagique commented 7 years ago

Assuming you are using Dirk Jonker's SQLAlchemy module, this should already work as expected... Could you please post some code that reproduces the issue in your case? I can easily work with MSSQL, but cannot with IBM puredata.

paolof89 commented 7 years ago

@MathMagique here a simple code that works for a table, but gives me the error for another one.

import sqlalchemy as sqa
import pandas as pd
engine = sqa.create_engine('mssql+turbodbc://dbserver/dbname?driver=ODBC+Driver+11+for+SQL+Server&trusted_connection=yes')
df = pd.read_sql('SELECT top 100000  * from mytable', engine)

The table that returns error contains the following column types:

[varchar] (max)
[date]
[float]
[bigint]

The one that doesn't give error contains:

[int]
[nvarchar] (3)
[date]
[float]

yaxxie commented 7 years ago

@paolof89

For the table which does not error, does the output in the nvarchar column decode correctly? In my case the system does not error but there is still clearly a decoding error.

Could I suggest you try with a small sample table with simple even and odd length ascii strings in the nvarchar columns to see different outputs?

MathMagique commented 7 years ago

@paolof89 I have tried to reproduce the MSSQL issue with pure turbodbc on Ubuntu 16 with Python 2.7.12 and the schema data you provided.

Schema information:

>>> import turbodbc
>>> c = turbodbc.connect(dsn='MSSQL', uid='SA', pwd='StrongPassword1')
>>> cur = c.cursor()
>>> cur.execute("SELECT column_name, data_type, character_maximum_length FROM INFORMATION_SCHEMA.Columns WHERE table_name = 'test_a'").fetchall()
[[u'a', u'varchar', -1L], [u'b', u'date', None], [u'c', u'float', None], [u'd', u'bigint', None]]
>>> cur.execute("SELECT column_name, data_type, character_maximum_length FROM INFORMATION_SCHEMA.Columns WHERE table_name = 'test_b'").fetchall()
[[u'a', u'int', None], [u'b', u'nvarchar', 3L], [u'c', u'date', None], [u'd', u'float', None]]

Query results look just fine with the official MS ODBC driver:

>>> import turbodbc
>>> c = turbodbc.connect(dsn='MSSQL', uid='***', pwd='***')
>>> cur = c.cursor()
>>> cur.execute("SELECT * FROM test_a").fetchall()
[[u'', datetime.date(2017, 1, 2), 3.14, 42L], [u'', datetime.date(2018, 3, 4), 2.71, 23L]]
>>> cur.execute("SELECT * FROM test_b").fetchall()
[[42L, u'hi', datetime.date(2017, 1, 2), 3.14], [23L, u'ho', datetime.date(2018, 3, 4), 2.71]]

They also look fine with the recommended prefer_unicode=True setting:

>>> o = turbodbc.make_options(prefer_unicode=True)
>>> c = turbodbc.connect(dsn='MSSQL', uid='***', pwd='***', turbodbc_options=o)
>>> cur = c.cursor()
>>> cur.execute("SELECT * FROM test_a").fetchall()
[[datetime.date(2017, 1, 2), 3.14, 42L], [datetime.date(2018, 3, 4), 2.71, 23L]]
>>> cur.execute("SELECT * FROM test_b").fetchall()
[[42L, u'hi', datetime.date(2017, 1, 2), 3.14], [23L, u'ho', datetime.date(2018, 3, 4), 2.71]]

They still look fine with the FreeTDS driver:

>>> c = turbodbc.connect(dsn='MSSQLFreeTDS', uid='***', pwd='***', turbodbc_options=o)
>>> cur = c.cursor()
>>> cur.execute("SELECT * FROM test_a").fetchall()
[[datetime.date(2017, 1, 2), 3.14, 42L], [datetime.date(2018, 3, 4), 2.71, 23L]]
>>> cur.execute("SELECT * FROM test_b").fetchall()
[[42L, u'hi', datetime.date(2017, 1, 2), 3.14], [23L, u'ho', datetime.date(2018, 3, 4), 2.71]]

I think I need more information. Which Python version do you use? Which ODBC driver? Can you provide data that provokes the error? Can you post the exact error message?

Sorry guys for being slow on this, but it is tough to diagnose something you can't reproduce.

paolof89 commented 7 years ago

I think the problem is in the [varchar] (max) column. I can reproduce the error on this table.

CREATE TABLE [dbo].prova_turbodbc(
    [c] [varchar] (max) NULL)

INSERT INTO [dbo].[prova_turbodbc]
           ([c])
     VALUES
           ('001'), ('002'), ('003')

python 3.5 code to reproduce the error:

import sqlalchemy as sqa
import pandas as pd
engine = sqa.create_engine('mssql+turbodbc://dbserver/dbname?driver=ODBC+Driver+11+for+SQL+Server&trusted_connection=yes')
df = pd.read_sql('SELECT * from prova_turbodbc', engine)`

Error message:

ystemError: ..\Objects\listobject.c:307: bad argument to internal function During handling of the above exception, another exception occurred: Traceback (most recent call last): File "C:\Program Files\Anaconda2\envs\mle-env\lib\site-packages\IPython\core\interactiveshell.py", line 2881, in run_code exec(code_obj, self.user_global_ns, self.user_ns) File "", line 1, in df = pd.read_sql('SELECT from prova_turbodbc', engine) File "C:\Program Files\Anaconda2\envs\mle-env\lib\site-packages\pandas\io\sql.py", line 416, in read_sql chunksize=chunksize) File "C:\Program Files\Anaconda2\envs\mle-env\lib\site-packages\pandas\io\sql.py", line 1096, in read_query data = result.fetchall() File "C:\Program Files\Anaconda2\envs\mle-env\lib\site-packages\sqlalchemy\engine\result.py", line 1126, in fetchall self.cursor, self.context) File "C:\Program Files\Anaconda2\envs\mle-env\lib\site-packages\sqlalchemy\engine\base.py", line 1397, in _handle_dbapi_exception util.reraise(exc_info) File "C:\Program Files\Anaconda2\envs\mle-env\lib\site-packages\sqlalchemy\util\compat.py", line 187, in reraise raise value File "C:\Program Files\Anaconda2\envs\mle-env\lib\site-packages\sqlalchemy\engine\result.py", line 1120, in fetchall l = self.process_rows(self._fetchall_impl()) File "C:\Program Files\Anaconda2\envs\mle-env\lib\site-packages\sqlalchemy\engine\result.py", line 1071, in _fetchall_impl return self.cursor.fetchall() File "C:\Program Files\Anaconda2\envs\mle-env\lib\site-packages\turbodbc\exceptions.py", line 50, in wrapper return f(*args, *kwds) File "C:\Program Files\Anaconda2\envs\mle-env\lib\site-packages\turbodbc\cursor.py", line 228, in fetchall return [row for row in self] File "C:\Program Files\Anaconda2\envs\mle-env\lib\site-packages\turbodbc\cursor.py", line 228, in return [row for row in self] File "C:\Program Files\Anaconda2\envs\mle-env\lib\site-packages\turbodbc\cursor.py", line 79, in next element = self.fetchone() File "C:\Program Files\Anaconda2\envs\mle-env\lib\site-packages\turbodbc\exceptions.py", line 50, in wrapper return f(args, **kwds) File "C:\Program Files\Anaconda2\envs\mle-env\lib\site-packages\turbodbc\cursor.py", line 214, in fetchone result = self.result_set.fetch_row() SystemError: <built-in method fetch_row of PyCapsule object at 0x0000003DE69D3C00> returned a result with an error set

MathMagique commented 7 years ago

Having a second look, I found that even my original results are not fine for the test_a table. Without prefer_unicode=True, the strings are empty. With prefer_unicode=True, the strings are simply missing.

In SQLAlchemy, this leads to various errors in my Linux VM (not exactly the one you reported, though). @paolof89 I will split off an issue for this, as it seems the problem is (unfortunately) unrelated to what @yalwan-scee reported.

yaxxie commented 6 years ago

Hi @MathMagique

I have come back to revist this. Through some code modifications I can say that your remark about UTF8 encoding being placed in the buffers meant for UTF16 encoding is correct (I changed the unicode_description to string_description and found the nvarchar columns decode correctly).

I am thinking about contributing a patch which depends on an option being set (such as stupid_encoding_in_wchar_buffers -- I'm being facetious) which switches SQL_WCHAR to following the string path and also allocates more memory if this option is set (since we're going to get actual multibyte chars). Does this sound acceptable solution to the problem?

yaxxie commented 6 years ago

I'd also like to note that the current unicode path risks underallocating memory for code points which are not single char sequences in UTF16. Should this be fixed?

MathMagique commented 6 years ago

Hi @yaxxie! In principle, I am fine with any approach that allows turbodbc to be compatible with your use case while not adding any overhead to databases/drivers that behave appropriately. The one big issue for me is that I would require a unit or integration test that the functionality actually works as intended. This test needs to be executed using the CI infrastructure, i.e., travis and/or AppVeyor. Also, things should be fixed for parameters if they require fixing.

I am not sure if the UTF16 thing really risks underallocating memory. Databases that support Unicode in VARCHAR(n) fields usually report 4*n characters when pulling these values as CHAR. I would expect a database to report this as 2*n characters when pulling these values as WCHAR. If the database internally uses UTF16 to store data (also supporting multicharacter sequences), the number of actual characters in a VARCHAR(n) field would depend on the number of multicharacter sequences in the specific value.

yaxxie commented 6 years ago

I made that remark because I checked for the IBM PDA which report the length of the NVARCHAR field (e.g. it reports NVARCHAR(11) as 11, which would be underallocating if any of those happened to be 2x symbols to represent a char rather than just one).

I did have a think about tests, obviously we can't test IBM PDA in Travis/AppVeyor, the best I could come up with was to have an nvarchar field e.g. in postgres, and load a few strings in (including funny multibyte chars in utf8), for which we know the expected outputs if we encode with utf16 and then decode as utf8. I know, this is hardly sounding like a good test. Do you have other suggestions?

MathMagique commented 6 years ago

Have you tried storing 11 multi-byte sequences in IBM PDA for NVARCHAR(11)? If putting such in the database and retrieving it correctly would work, then one could think about the overallocation thing.

Your test idea actually sounds nifty :joy:. It would require a big comment, though. If certain parts of the logic could be extracted to the turbodbc library, one could probably unit test most of the functionality there as well.

yaxxie commented 6 years ago

So what I did was

I changed SQL_WCHAR switch in make_description.cpp to string_description instead of unicode description, and recompiled. Retrieving ascii strings from the nvarchar column were now no longer garbled I inserted extra japanese strings, one of length 10 one of length 11. When retrieving with this change, they were both truncated to 3chars. I altered the table to nvarchar 22. They were both truncated to 7chars I altered the table to nvarchar 33. They were both retrieved correctly Then, I altered string_description::do_element_size() to return 4*maximum_length_ and recompiled. The Japanese strings (10/11) were correctly retrieved from nvarchar(11) field.

The changes I describe above are both in turbodbc, not turbodbc_* so I can probably do a unit test too. I just never looked at the unit test code wrt this so far.

Hows all of this sounding? Do you think we can work towards something which could be accepted into the main project?

MathMagique commented 6 years ago

Yeah, I think we can work towards making this work. The configuration option could be force_unicode_as_utf8 or similar. (I'm not great at naming at this very second)

It amazes me how wildly different ODBC drivers behave :-D. For the truncation thing, it would be best to have a configuration option like force_extra_bytes_for_multibyte_unicode_characters that one can use to increase the buffer space by 4 (CHAR) or 2 (WCHAR). That should probably be a separate pull request.

I will be happy to accept your pull request. I may not have much time to review stuff next week, but if you have something that would benefit from intermediate review, just open the PR even if you are not done yet. That would also help with getting started with the build system and all.

yaxxie commented 6 years ago

Any suggestions on how to get options visible to description and subclasses? I want to initialise the base class with a protected const copy of options so we can access it from do_element_size() methods -- does this seem sane?

yaxxie commented 6 years ago

Well that seems complicated, the other option is to "lie" about the max length when passing it to the constructor :thinking:

MathMagique commented 6 years ago

Lying seems like an easy idea. Just think of the length in single-byte characters, then it is not even a lie ;-)

yaxxie commented 6 years ago

Opened one PR #154 to deal with option for unicode allocation. Will bring another PR shortly for decoding UTF8 in WCHAR buffers. Neither have unit tests yet.

yaxxie commented 6 years ago

Also added #155

blue-yonder / turbodbc

Issue with decoding UTF8 strings from WCHAR buffers #64