ClickHouse / clickhouse-odbc

ODBC driver for ClickHouse
https://clickhouse.tech
Apache License 2.0
247 stars 88 forks source link

Issues with Null Values - Clickhouse ODBC Driver for Windows #192

Open ajennings67 opened 5 years ago

ajennings67 commented 5 years ago

I have been experiencing some issues when trying to query data that contains null values via the Clickhouse ODBC Driver for Windows. I have tested reading and writing from both Ubuntu and Windows. Reading from Windows always fails and reading from Ubuntu always succeeds, regardless of which platform was used to write.

For the sake of space, I won't include the utility I wrote to read and write to CH, but I can provide details as needed. It uses the following format to write:

cmd = f'cat {path} | clickhouse-client -h {server} --query="INSERT INTO {table} FORMAT Parquet"'

and the following format to read:

    with pyodbc.connect(connection_str, autocommit=True) as con:
        con.setencoding("utf-8")
        df = pd.read_sql(qry, con)
    return df

where the connection uses the Clickhouse ODBC Driver for Windows.

I use the datatype Nullable(Float64) for all the columns in my table.

Expected Behavior

Create example, write, and read via Ubuntu: image

Actual Behavior

Using the same example from above, except we are using the Windows ODBC Driver: image

Error:

---------------------------------------------------------------------------
Error                                     Traceback (most recent call last)
<ipython-input-5-483e3b2a980f> in <module>
----> 1 windowsOutput = ch.query_df("SELECT * FROM testODBC")

H:\BOND_TRA\ATJ\Quant\Projects\DMA\clickhouse.py in query_df(qry)
    127     with pyodbc.connect(connection_str, autocommit=True) as con:
    128         con.setencoding("utf-8")
--> 129         df = pd.read_sql(qry, con)
    130     return df
    131 

~\AppData\Local\Continuum\anaconda3\envs\DMA_CH\lib\site-packages\pandas\io\sql.py in read_sql(sql, con, index_col, coerce_float, params, parse_dates, columns, chunksize)
    378             sql, index_col=index_col, params=params,
    379             coerce_float=coerce_float, parse_dates=parse_dates,
--> 380             chunksize=chunksize)
    381 
    382     try:

~\AppData\Local\Continuum\anaconda3\envs\DMA_CH\lib\site-packages\pandas\io\sql.py in read_query(self, sql, index_col, coerce_float, params, parse_dates, chunksize)
   1475                                         parse_dates=parse_dates)
   1476         else:
-> 1477             data = self._fetchall_as_list(cursor)
   1478             cursor.close()
   1479 

~\AppData\Local\Continuum\anaconda3\envs\DMA_CH\lib\site-packages\pandas\io\sql.py in _fetchall_as_list(self, cur)
   1484 
   1485     def _fetchall_as_list(self, cur):
-> 1486         result = cur.fetchall()
   1487         if not isinstance(result, list):
   1488             result = list(result)

Error: ('HY000', '[HY000] Syntax error: Not a valid floating-point number: á´ºáµ\x81ᴸᴸ (1) (SQLGetData)')

To ensure this isn't an issue related to writing via Window, here we use windows to read the table written by Linux: image

Further, linux can read the data written by windows: image

Now that is issue is specific to Windows reading, we can see that the issue stems from the Nan...

No Nan in query results: image

Nan included in query results image

Summary

It appears that the Windows ODBC Driver is not able to handle Nan values in query result

Thanks in advance and please let me know if I can provide additional information.

filimonov commented 4 years ago

Can you please provide some minimal example of python code using some standard libs, so we can check/reproduce that? It it not very handy to reconstruct the code from screenshots and guess what those function df_to_ch, query_df etc. do and where do they came from.

You can just upload some snippet, and fragments of H:\BOND_TRA\ATJ\Quant\Projects\DMA\clickhouse.py to gist. Thank you.

traceon commented 4 years ago

Meanwhile, I tried with ODBCTest tool (in Windows) and isql tool (in Linux), and both, null's and nan's seem to work with either of those tools as expected in the recent code. I was trying with changes from this PR https://github.com/ClickHouse/clickhouse-odbc/pull/211 applied to the master branch locally, but the nature of the involved code suggests that the latest release (v1.1.1.20191108) should work in this case too.

Table created, populated, and verified as:

$ clickhouse-client --query "CREATE TABLE nullable_nan_col_issue_tbl ( nullable_nan_col Nullable(Float64) ) ENGINE = Memory"
$ clickhouse-client --query "INSERT INTO nullable_nan_col_issue_tbl VALUES ( NULL ), ( NAN ), (1.23456789)"
$ clickhouse-client --query "SELECT * FROM nullable_nan_col_issue_tbl"
\N
nan
1.2345678900000001

ODBCTool data retrieval result:

Full Connect(Default)

    Env. Attr. SQL_ATTR_ODBC_VERSION set to SQL_OV_ODBC3

    Successfully connected to DSN 'ClickHouse_DSNW_Debug'.
SQLExecDirect:
                In:             Statementhandle = 0x0000000002064290, StatementText = "SELECT * FROM nullable_nan_col_issue_tbl", Statementlength = 40
                Return: SQL_SUCCESS=0

Get Data All:
"nullable_nan_col"
<Null>
nan
1.2345678900000001
3 rows fetched from 1 column.

isql data retrieval result:

$ isql ClickHouse_DSNW_Debug
+---------------------------------------+
| Connected!                            |
|                                       |
| sql-statement                         |
| help [tablename]                      |
| quit                                  |
|                                       |
+---------------------------------------+
SQL> SELECT * FROM nullable_nan_col_issue_tbl
+-------------------+
| nullable_nan_col  |
+-------------------+
|                   |
| nan               |
| 1.2345678900000001|
+-------------------+
SQLRowCount returns 3
3 rows fetched
SQL>