blue-yonder / turbodbc

Turbodbc is a Python module to access relational databases via the Open Database Connectivity (ODBC) interface. The module complies with the Python Database API Specification 2.0.
http://turbodbc.readthedocs.io/en/latest
MIT License
617 stars 86 forks source link

Handling nan values for float or int types turbodbc #221

Open dayxx369 opened 5 years ago

dayxx369 commented 5 years ago

Hello,

I really enjoy using turbodbc. It is very helpful, but there is one concern. It is difficult to handle float or int columns with nan values; even when I turn them to None or try to make the entire column object type so that I can insert null values as a string, it does not work properly. Is there any work around? Thanks!

xhochy commented 5 years ago

@dayxx369 How do you use turbodbc? Do you use fetchall, fetchallnumpy or fetchallarrow? All of these should support Nones in the columns.

dayxx369 commented 5 years ago

@xhochy Hey, I am importing data to our database. I use: cursor.fast_executemany = True cursor.executemanycolumns(sql, values_df) conn.commit()

This typically works for me for most dataframes I work with, but this one (35 columns, 6k rows) is not working and I believe it is because of the nan values in the int or float columns (but I change those to object columns and replace the nan with 'NA' string and still getting errors).

xhochy commented 5 years ago

@dayxx369 Can you post the error message, please? This may provide more insight.

Otherwise please try to post an https://stackoverflow.com/help/minimal-reproducible-example as this would make it easier for me to understand what is going wrong.

dayxx369 commented 5 years ago

Unable to cast Python instance to C++ type (compile in debug mode for details)

or I would get

Error: ODBC error state: 42000 native error code: 8023 message: [Microsoft][ODBC SQL Server Driver][SQL Server]The incoming tabular data stream (TDS) remote procedure call (RPC) protocol stream is incorrect. Parameter 34 (""): The supplied value is not a valid instance of data type float. Check the source data for invalid values. An example of an invalid value is data of numeric type with scale greater than precision.

the first is when I make all the columns Object types and the last is when I try not to change the values of the int or float columns

xhochy commented 5 years ago

This definitely seems like a more complex problem, can you make a https://stackoverflow.com/help/minimal-reproducible-example for the second case?

yhf8377 commented 4 years ago

I encountered the same issue. Here is the minimal reproducible example as requested.

# docker-compose.yml
version: '3'
services:
  mssql-db:
    image: mcr.microsoft.com/mssql/server:2017-latest
    networks:
      - 'internal'
    ports:
      - '1433:1433'
    environment:
      - 'ACCEPT_EULA=Y'
      - 'SA_PASSWORD=3nQ5bvvKo3vDrBJ6'
      - 'MSSQL_PID=Developer'

  turbodbc-client:
    build:
      context: .
      dockerfile: Dockerfile
    networks:
      - 'internal'
    volumes:
      - './test.py:/test.py:ro'

networks:
  internal:
# Dockerfile
FROM python:3.7

RUN apt-get update \
    && apt-get install -y --no-install-recommends gnupg2 libboost-locale-dev unixodbc-dev python3-dev \
    && curl https://packages.microsoft.com/keys/microsoft.asc | apt-key add - \
    && curl https://packages.microsoft.com/config/debian/$(cat /etc/debian_version | cut -d'.' -f1)/prod.list > /etc/apt/sources.list.d/mssql-release.list \
    && apt-get update \
    && ACCEPT_EULA=Y apt-get install -y --no-install-recommends msodbcsql17 \
    && pip --no-cache-dir install numpy pandas sqlalchemy pybind11 turbodbc sqlalchemy-turbodbc \
    && pip --no-cache-dir install git+https://github.com/blaze/odo.git \
    && echo "[MSSQL]" >/etc/odbc.ini \
    && echo "Driver=$(head -n1 /etc/odbcinst.ini | sed -e 's/\[//' -e 's/\]//')" >>/etc/odbc.ini \
    && echo "Server=mssql-db,1433" >>/etc/odbc.ini \
    && echo "Database=test_db" >>/etc/odbc.ini \
    && apt-get autoremove -y \
    && apt-get clean -y \
    && rm -rf /var/lib/apt/lists/* /tmp/*

CMD [ "python", "/test.py" ]
# test.py
import numpy as np
import pandas as pd
import turbodbc as odbc
import odo

df_1 = pd.DataFrame({'value': [0.0, 1.0, 2.0, 3.0]})
df_2 = pd.DataFrame({'value': [0.0, 1.0, 2.0, 3.0, np.nan]})

options = odbc.make_options(prefer_unicode=True)
conn = odbc.connect(dsn='MSSQL', uid='sa', pwd='3nQ5bvvKo3vDrBJ6', turbodbc_options=options)

uri = 'mssql+turbodbc://sa:3nQ5bvvKo3vDrBJ6@MSSQL::test_tbl'

odo.odo(df_1, uri)
print('df_1 imported')

odo.odo(df_2, uri)      # this line will throw an error
print('df_2 imported') 

Here are the steps to re-produce the error:

# Run these commands in Bash Shell
docker-compose build
docker-compose up -d mssql-db

docker-compose exec mssql-db /bin/bash -c "/opt/mssql-tools/bin/sqlcmd -S localhost -U sa -P 3nQ5bvvKo3vDrBJ6 -Q 'SELECT @@version'"
docker-compose exec mssql-db /bin/bash -c "/opt/mssql-tools/bin/sqlcmd -S localhost -U sa -P 3nQ5bvvKo3vDrBJ6 -Q 'CREATE DATABASE test_db'"
docker-compose exec mssql-db /bin/bash -c "/opt/mssql-tools/bin/sqlcmd -S localhost -U sa -P 3nQ5bvvKo3vDrBJ6 -Q 'USE test_db; CREATE TABLE test_tbl (value NUMERIC(20,4))'"

docker-compose run --rm turbodbc-client

Here are the outputs:

frank@Franks-MacBook-Pro ~/Git/turbodbc-test % docker-compose build
mssql-db uses an image, skipping
Building turbodbc-client
Step 1/3 : FROM python:3.7
 ---> 023b89039ba4
Step 2/3 : RUN apt-get update     && apt-get install -y --no-install-recommends gnupg2 libboost-locale-dev unixodbc-dev python3-dev     && curl https://packages.microsoft.com/keys/microsoft.asc | apt-key add -     && curl https://packages.microsoft.com/config/debian/$(cat /etc/debian_version | cut -d'.' -f1)/prod.list > /etc/apt/sources.list.d/mssql-release.list     && apt-get update     && ACCEPT_EULA=Y apt-get install -y --no-install-recommends msodbcsql17     && pip --no-cache-dir install numpy pandas sqlalchemy pybind11 turbodbc sqlalchemy-turbodbc     && pip --no-cache-dir install git+https://github.com/blaze/odo.git     && echo "[MSSQL]" >/etc/odbc.ini     && echo "Driver=$(head -n1 /etc/odbcinst.ini | sed -e 's/\[//' -e 's/\]//')" >>/etc/odbc.ini     && echo "Server=mssql-db,1433" >>/etc/odbc.ini     && echo "Database=test_db" >>/etc/odbc.ini     && apt-get autoremove -y     && apt-get clean -y     && rm -rf /var/lib/apt/lists/* /tmp/*
 ---> Using cache
 ---> df89e9a85037
Step 3/3 : CMD [ "python", "/test.py" ]
 ---> Using cache
 ---> 35a61c32f63f
Successfully built 35a61c32f63f
Successfully tagged turbodbc-test_turbodbc-client:latest
frank@Franks-MacBook-Pro ~/Git/turbodbc-test % docker-compose up -d mssql-db
Creating network "turbodbc-test_internal" with the default driver
Creating turbodbc-test_mssql-db_1 ... done
frank@Franks-MacBook-Pro ~/Git/turbodbc-test % docker-compose exec mssql-db /bin/bash -c "/opt/mssql-tools/bin/sqlcmd -S localhost -U sa -P 3nQ5bvvKo3vDrBJ6 -Q 'SELECT @@version'"

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Microsoft SQL Server 2017 (RTM-CU17) (KB4515579) - 14.0.3238.1 (X64) 
    Sep 13 2019 15:49:57 
    Copyright (C) 2017 Microsoft Corporation
    Developer Edition (64-bit) on Linux (Ubuntu 16.04.6 LTS)                                                                                                            

(1 rows affected)
frank@Franks-MacBook-Pro ~/Git/turbodbc-test % docker-compose exec mssql-db /bin/bash -c "/opt/mssql-tools/bin/sqlcmd -S localhost -U sa -P 3nQ5bvvKo3vDrBJ6 -Q 'CREATE DATABASE test_db'"
frank@Franks-MacBook-Pro ~/Git/turbodbc-test % docker-compose exec mssql-db /bin/bash -c "/opt/mssql-tools/bin/sqlcmd -S localhost -U sa -P 3nQ5bvvKo3vDrBJ6 -Q 'USE test_db; CREATE TABLE test_tbl (value NUMERIC(20,4))'"
Changed database context to 'test_db'.
frank@Franks-MacBook-Pro ~/Git/turbodbc-test % docker-compose run --rm turbodbc-client
df_1 imported
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/turbodbc/exceptions.py", line 50, in wrapper
    return f(*args, **kwds)
  File "/usr/local/lib/python3.7/site-packages/turbodbc/cursor.py", line 165, in executemany
    buffer.flush()
turbodbc_intern.Error: ODBC error
state: 42000
native error code: 8023
message: [Microsoft][ODBC Driver 17 for SQL Server][SQL Server]The incoming tabular data stream (TDS) remote procedure call (RPC) protocol stream is incorrect. Parameter 1 (""): The supplied value is not a valid instance of data type float. Check the source data for invalid values. An example of an invalid value is data of numeric type with scale greater than precision.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1226, in _execute_context
    cursor, statement, parameters, context
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy_turbodbc/dialect.py", line 72, in do_executemany
    cursor.executemany(statement, list(parameters))
  File "/usr/local/lib/python3.7/site-packages/turbodbc/exceptions.py", line 52, in wrapper
    raise DatabaseError(str(e))
turbodbc.exceptions.DatabaseError: ODBC error
state: 42000
native error code: 8023
message: [Microsoft][ODBC Driver 17 for SQL Server][SQL Server]The incoming tabular data stream (TDS) remote procedure call (RPC) protocol stream is incorrect. Parameter 1 (""): The supplied value is not a valid instance of data type float. Check the source data for invalid values. An example of an invalid value is data of numeric type with scale greater than precision.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/test.py", line 17, in <module>
    odo.odo(df_2, uri)      # this line will throw an error
  File "/usr/local/lib/python3.7/site-packages/odo/odo.py", line 91, in odo
    return into(target, source, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/multipledispatch/dispatcher.py", line 278, in __call__
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/odo/into.py", line 43, in wrapped
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/odo/into.py", line 143, in into_string
    return into(a, b, dshape=dshape, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/multipledispatch/dispatcher.py", line 278, in __call__
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/odo/into.py", line 43, in wrapped
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/odo/into.py", line 131, in into_object
    return append(target, source, dshape=dshape, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/multipledispatch/dispatcher.py", line 278, in __call__
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/odo/backends/sql.py", line 583, in append_anything_to_sql_Table
    return append(t, convert(Iterator, o, **kwargs), **kwargs)
  File "/usr/local/lib/python3.7/site-packages/multipledispatch/dispatcher.py", line 278, in __call__
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/odo/backends/sql.py", line 569, in append_iterator_to_table
    bind.execute(t.insert(), chunk)
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 2182, in execute
    return connection.execute(statement, *multiparams, **params)
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 982, in execute
    return meth(self, multiparams, params)
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/sql/elements.py", line 287, in _execute_on_connection
    return connection._execute_clauseelement(self, multiparams, params)
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1101, in _execute_clauseelement
    distilled_params,
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1250, in _execute_context
    e, statement, parameters, cursor, context
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1476, in _handle_dbapi_exception
    util.raise_from_cause(sqlalchemy_exception, exc_info)
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/util/compat.py", line 398, in raise_from_cause
    reraise(type(exception), exception, tb=exc_tb, cause=cause)
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/util/compat.py", line 152, in reraise
    raise value.with_traceback(tb)
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1226, in _execute_context
    cursor, statement, parameters, context
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy_turbodbc/dialect.py", line 72, in do_executemany
    cursor.executemany(statement, list(parameters))
  File "/usr/local/lib/python3.7/site-packages/turbodbc/exceptions.py", line 52, in wrapper
    raise DatabaseError(str(e))
sqlalchemy.exc.DatabaseError: (turbodbc.exceptions.DatabaseError) ODBC error
state: 42000
native error code: 8023
message: [Microsoft][ODBC Driver 17 for SQL Server][SQL Server]The incoming tabular data stream (TDS) remote procedure call (RPC) protocol stream is incorrect. Parameter 1 (""): The supplied value is not a valid instance of data type float. Check the source data for invalid values. An example of an invalid value is data of numeric type with scale greater than precision.
[SQL: INSERT INTO test_tbl (value) VALUES (?)]
[parameters: ((0.0,), (1.0,), (2.0,), (3.0,), (nan,))]
(Background on this error at: http://sqlalche.me/e/4xp6)

As you can see from the output, the message "df_1 imported" indicates that the database connection and data import is working when there is no np.nan value presents. However, as soon as we have a np.nan we will receive an error.

In data science a NaN value is a valid for many useful scenarios. I would like to be able to keep these values rather than re-code them to zeros or something else.

alexrabe91 commented 4 years ago

is there any update for that issue?

yputter commented 4 years ago

Maybe you could try converting your pandas dataframe to a list of numpy maskedarrays (one for each column), I think this is mentioned in the turbodbc docs. A rough (incomplete) code sketch:

# Construct numpy maskedarrays from df
sql_values = []
    for col in df.columns:
        kwargs = dict(data=df[col].values)
        if np.issubdtype(df[col].dtype, np.datetime64):
            if any(df[col].isna()):
                raise ValueError  # NaT datetimes in numpy maskedarray don't work, so no mask
        else:
            kwargs["mask"] = pd.isnull(df[col].values)
        sql_values.append(np.ma.MaskedArray(**kwargs))

then something like

with connection:
        try:
            cursor = connection.cursor()
            cursor.executemanycolumns(sql_query, sql_values)
            connection.commit()
        except Exception as e:
            connection.rollback()
            logging.critical(e)