dolphinsmalltalk / Dolphin

Dolphin Smalltalk Core Image
MIT License
301 stars 58 forks source link

Database Connection - Incorrect buffer size for wide string columns under some databases/drivers #768

Closed rko281 closed 5 years ago

rko281 commented 5 years ago

I'm updating my ReStore database framework following the recent changes to DBConnection and am running into a few issues with wide (UTF16) string columns.

Under MSAccess and SQLite ODBC drivers the size of wide string columns is returned as their character size, not the storage size. This leads to a bounds error when recovering data from those columns. Example:

connection := DBConnection new.
connection dsn: 'AnAccessDSN'.
connection connect.

connection exec: 'create table utf16_test (a_string VARCHAR (8))'.
(connection exec: 'select * from utf16_test') describeCols: #(1).  
"#(a DBColAttr(1, a_string, SQL_WVARCHAR, 8))"

connection exec: 'insert into utf16_test(a_string) values(''12345678'')'.
(connection exec: 'select * from utf16_test') results next  
"Index 16 is out of bounds"

The issue doesn't present with the other databases/drivers I've tested (SQLServer Express, Postrgres, MySQL).

Looking at senders of sqlColAttribute:columnNumber:... the issue can partly be traced to the fieldIdentifier parameter of SQL_COLUMN_LENGTH which has the comment "Note using ODBC 2.x definition of column length". I can't find the relevant documentation for this parameter, however the documentation for the ODBC 3 near-equivalent of SQL_DESC_LENGTH (1003) states:

A numeric value that is either the maximum or actual character length of a character string or binary data type.

i.e. the character length, not the storage length. Using this parameter instead of SQL_COLUMN_LENGTH displays the same issue.

The ODBC 3 parameter SQL_DESC_OCTET_LENGTH (1013; I can't find an ODBC 2.0 version of this) would appear to be the correct one to use; the documentation states:

The length, in bytes, of a character string or binary data type.

Indeed, changing to this parameter fixes the wide string issue for both Access and SQLite. Unfortunately this causes a different issue for Access DATETIME fields:

connection exec: 'create table datetime_test (a_timestamp DATETIME)'.
(connection exec: 'select * from datetime_test') describeCols: #(1). 
"With SQL_COLUMN_LENGTH         #(a DBColAttr(1, a_timestamp, SQL_TIMESTAMP, 16))"
"With SQL_DESC_OCTET_LENGTH     #(a DBColAttr(1, a_timestamp, SQL_TIMESTAMP, 8))"

Recovering DATETIME fields with this value results in an undersized buffer and usually the image crashing out.

The solution I'm using for now is to use SQL_DESC_OCTET_LENGTH when dealing with string or binary columns (as per the documentation) and SQL_COLUMN_LENGTH for all other column types - this works for Access and the other databases I've tried.

I'd like to submit this fix however I'm a little uncomfortable mixing parameters from different ODBC versions. I've tried using SQL_DESC_LENGTH instead of SQL_COLUMN_LENGTH for non-string columns however this causes Access to return 19 (the column precision) as the size of DATETIME columns - which works since it delivers a big enough buffer, but is technically incorrect.

I also note that the documentation for both SQL_DESC_LENGTH and SQL_DESC_OCTET_LENGTH only refer to their use with string/binary columns, though they do also work (Access oddities aside) with other column types. I'd be interested to see the documentation for SQL_COLUMN_LENGTH if anyone can locate it.

blairmcg commented 5 years ago

I think you are right to be reticent John. DBConnection class>>setDefaultEnvAttrs is requesting ODBC 2.x as the version level, so the driver is not required to understand the 3.x identifiers. The SQL Server 2017 documentation for SQLColAttribute has a Backward Compatibility section that refers to this. It implies that SQL_COLUMN_LENGTH is equivalent to SQL_DESC_PRECISION, but your evidence suggests otherwise. I suspect that one of the reasons this changed between 2.x and 3.x is that the meaning of these values and relationship to data transfer size was not clearly specified in 2.x, hence the variable behaviour between drivers. I think the "right" thing to do here is to make the switch to requesting ODBC 3.x level driver support at the environment level, and then use the Transfer Octet Length as you have done (i.e. SQL_DESC_OCTET_LENGTH) for "all character and binary types". It is unfortunate that this doesn't work for all column types, but then for the other types (or at least the ones supported in Database Connection currently, which does not include the time interval types) they seem to have fixed sizes. For fixed sizes there is no need to request/use this information. I don't think we should use the ODBC 3.x constants after telling ODBC (and presumably, that gets fed through to the underlying drivers) that we are requesting 2.x level APIs - if that works, it will be by luck. The problem with switching to ODBC 3.x completely of course, is that there are likely to be other things that need changing, and in the absence of tests it is a bit of a blind change. I'll try and see if I can dig up the old 2.x documentation somewhere first.

blairmcg commented 5 years ago

Microsoft ODBC 2.0 Programmer’s Manual.

rko281 commented 5 years ago

Thanks Blair. So SQL_COLUMN_LENGTHis:

The length in bytes of data transferred on an SQLGetData or SQLFetch operation if SQL_C_DEFAULT is specified.

So it is supposed to be the length in bytes for any column type - though only if SQL_C_DEFAULT is used, which isn't the case with Dolphin. If I change DBBoundBuffer>>bind: to use SQL_C_DEFAULT then with Access I actually get 8 bytes of data back for the wide string column in the example, containing the ANSI representation of the string (which is then interpreted as a Utf16String, rendering nonsense). So Access does follow the standard in a weird way (possibly unicode isn't properly supported under ODBC 2?). SQLite on the other hand still returns the actual Utf16String data but only an 8 byte buffer.

For what it's worth I've tried running the ReStore test suite (>1000 unit tests involving database interaction) with ODBC 3.x requested as the version level and using SQL_DESC_OCTET_LENGTH instead of SQL_COLUMN_LENGTH. I also had to add the new SQL_TYPE_DATE, SQL_TYPE_TIME and SQL_TYPE_TIMESTAMP type constants to the relevant class vars and stop DBErrors being raised when SQL_NO_DATA is returned (indicating no rows affected by a write query). With these changes the tests ran clean in SQLite, so switching to ODBC 3 may be straightforward. Admittedly there were probably still 2.x constants used in various places, and the test suite also needs to be run across more databases/drivers.

This on its own doesn't solve the Access DATETIME issue though - this column type is still returned with an undersized buffer. So a full solution would be to convert to ODBC 3.x using SQL_DESC_OCTET_LENGTH, and have non-String/Binary columns define their own buffer size based on their type. I'm happy to run with this if you think it's an acceptable solution. Alternatively I can accommodate the issues in ReStore itself if it's not a priority.

blairmcg commented 5 years ago

Yes, I came to the same understanding on reading the relevant bits of the document myself.

I don't think Unicode was officially supported at all in ODBC 2.0. Certainly there is no mention of it in the document. That would explain the issues with Access. I'd assumed that the column types were in 2.x because they were already in the Dolphin code, but that must have been added later. Consequently we already have a 2.x/3.x hybrid that doesn't work correctly for some drivers. I'm also curious to understand the meaning of that comment in the Dolphin code you referenced Note using ODBC 2.x definition of column length, since the constant originates in 1.x, this suggests there was some difference in interpretation between 1.x and 2.x, but it probably doesn't matter any more.

I think the solution you suggest is the right one - my reading of the current documentation (unfortunately not really of the quality of that ODBC 2.0 PDF) is that this is pretty much the way it is intended to be done with 3.x.. Along with the modifications you already have, DBColAttr>>lengthC can just return a fixed size for the date/time structure types (maybe it is even as simple as adding these to the CTypesExtraBytes array - if the length is initialized to zero for the SQL_NO_DATA results from SQL_DESC_OCTET_LENGTH, then that will "just work") .