blue-yonder / turbodbc

Turbodbc is a Python module to access relational databases via the Open Database Connectivity (ODBC) interface. The module complies with the Python Database API Specification 2.0.
http://turbodbc.readthedocs.io/en/latest
MIT License
607 stars 85 forks source link

How to deal with Snowflake string length reporting #395

Open jonashaag opened 9 months ago

jonashaag commented 9 months ago

Snowflake's ODBC driver reports all CHAR columns as having length 16 MiB (the maximum length of a string in Snowflake). Not sure if this is a problem with the ODBC driver or with Turbodbc.

Currently you can't really read batches of a reasonable number of rows because Turbodbc thinks all rows require a n_string_cols * 16 MiB buffer.

jonashaag commented 9 months ago

arrow-odbc-py solution: https://github.com/pacman82/arrow-odbc-py/issues/56

pacman82 commented 9 months ago

I fear it is a problem with the ODBC standard 😅

ODBC 4.0 is going to fix this since it allows to define exception behavior for truncation then fetching data. Yet this has been a making in the while and is not rolled out. Currently this has to be fixed on the application level. Use domain knowledge to specify more sensible upper bounds. If you do not know them run queries to find them.

I could imagine supporting running an query twice to solve this in arrow-odbc. Once to figure out the maximum length of the values per column and once more to actually produce the result. Yet I always feel applications which know the domains they are operating in could come up with cleaner solutions, or at least use persistence to remember the previous maximum columns.

Only alternative in the current ODBC standard is not using bulk fetches, yet this is the very thing, which makes arrow-odbc or turbodbc fast.