exasol / pyexasol

Exasol Python driver with low overhead, fast HTTP transport and compression
MIT License
71 stars 39 forks source link

bigint and certain decimals returned as string #95

Closed tglunde closed 2 years ago

tglunde commented 2 years ago

Implementing dbt-exasol I am following dbt-labs test suite to verify functionality of the dbt-exasol adapter.

In this test suite and also regarding a previous issue equality test dbt-utils, we have noticed a behaviour of the pyexasol, that I am not sure whether this is by design or accident.

When testing two relations to be equal, the adapter test is checking row counts and bi-directional EXCEPT query to realize the test:

with diff_count as (
    SELECT
        1 as id,
        COUNT(*) as num_missing FROM (
            (SELECT {columns} FROM {relation_a} {except_op}
             SELECT {columns} FROM {relation_b})
             UNION ALL
            (SELECT {columns} FROM {relation_b} {except_op}
             SELECT {columns} FROM {relation_a})
        ) as a
), table_a as (
    SELECT COUNT(*) as num_rows FROM {relation_a}
), table_b as (
    SELECT COUNT(*) as num_rows FROM {relation_b}
), row_count_diff as (
    select
        1 as id,
        table_a.num_rows - table_b.num_rows as difference
    from table_a, table_b
)
select
    row_count_diff.difference as row_count_difference,
    diff_count.num_missing as num_mismatched
from row_count_diff
join diff_count using (id)

Executing the test using test tables outputs the BIGINT column from the row_count_diff.difference CTE as follows:

('0', 0)

In dbt standard adapters (SQLAdapter as base class), isinstance(row[0]) is being used to figure that first column value is a string, because pyexasol delivers BIGINT and decimal with precisions > 18 in quotes / as strings. These standard equality tests then fail, because assertion on '0' failed because expected difference in row count is 0.

To then test the pyexasol with different datatypes, I created the following test python script to confirm behavior of pyexasol:

import pyexasol

TEST_SQL = """
with testsql as (
    select
        cast(1 as INTEGER) as value_int,
        cast(1 as decimal(18,0)) as value_decimal18,
        cast(1.111 as float) as value_float,
        cast(1.111 as decimal(6,3)) as value_decimal,
        cast(1 as decimal(19,0)) as value_decimal19,
        cast(1 as BIGINT) as value_bigint,
        cast('test' as VARCHAR(2000000)) AS value_text,
        cast(9999999999111111111111111.111 as double precision) as value_double,
        cast(true as boolean) as value_bool,
        cast(date'2022-01-01' AS date) AS value_date,
        cast(timestamp'2022-01-01 20:00:00.000' AS timestamp) AS value_ts,
        cast('2022-01-01 10:00:00.199' AS TIMESTAMP WITH LOCAL TIME ZONE) AS value_ts_tz
)
select * from testsql
"""

C = pyexasol.connect(dsn="localhost:8563", user="sys", password="start123")

stmt = C.execute(sql)

for row in stmt:
    print(row)
    print(
        "{}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}".format(
            type(row[0]),
            type(row[1]),
            type(row[2]),
            type(row[3]),
            type(row[4]),
            type(row[5]),
            type(row[6]),
            type(row[7]),
            type(row[8]),
            type(row[9]),
            type(row[10]),
            type(row[11]),
        )
    )
    assert isinstance(row[0], int)
    assert isinstance(row[1], int)
    assert isinstance(row[2], float)
    assert isinstance(row[3], float)
    assert isinstance(row[4], float)
    assert isinstance(row[5], int)
    assert isinstance(row[6], str)
    assert isinstance(row[7], float)
    assert isinstance(row[8], bool)
    assert isinstance(row[9], str)
    assert isinstance(row[10], str)
    assert isinstance(row[11], str)

C.close()

Using a exasol 7.1.11 on my local machine, I am getting the following output for the print(row) statement:

(1, 1, 1.111, '1.111', '1', '1', 'test', 9.99999999911111e+24, True, '2022-01-01', '2022-01-01 20:00:00.000000', '2022-01-01 10:00:00.199000')

This shows that INTEGER, DECIMAL(18,0) and FLOAT are represented as numbers - which leads to correct interpretations isinstance python method. But the DECIMAL(6,3), DECIMAL(19,0) and BIGINT numbers are being returned as strings and asserts will fail using isinstance.

My question is why is pyexasol showing this behaviour - is it by design? The standard SQL adapter implementation from dbt-labs is assuming that isinstance is delivering correct datatypes to detect text columns. This is why with pyexasol BIGINT and some DECIMAL datatypes are being treated as strings by default.

One solution would be to change the default SQLAdapter's behaviour to actually use schema information delivered by pyexasol instead of using the isinstance method. But I wanted to understand why pyexasol seems to be behaving differently than Postgres, Snowflake and Redhisft.

Regards, Torsten

littleK0i commented 2 years ago

This behaviour is dictated by Exasol Websocket protocol. It uses basic JSON for data transport, and it is not possible to represent big integers and decimals with fractions part in basic JSON without using strings.

Pyexasol does not transform any data by default and returns everything "as is".

However, you may change it by passing a custom function to fetch_mapper connection option.

You may use this mapper or create your own.

Example: https://github.com/exasol/pyexasol/blob/master/examples/a04_fetch_mapper.py


Please note, some transformations are expensive and will noticeably reduce fetching performance, which might be a problem for large data sets.

tglunde commented 2 years ago

Thanks for the answer. Since this is something out of the scope of pyexasol and I agree that adding this mapper to every fetch is probably an unnecessary overhead in the dbt-exasol adapter, I would rather try to overwrite the above-mentioned logic in SQLAdapter - where only for these specific cases (big integers and decimals with fraction) I would replace isinstance with using schema information to determine whether a cell is of type string. But one more question. Can you confirm that BIGINT and DECIMALs with fraction are the only datatypes that the Exasol Websocket Protocol is converting into strings? Just to make sure - do you have a link to a resource that describes the problem with Exasol Websocket protocol and "basic JSON" to use strings? Thx, Torsten

tkilias commented 2 years ago

@tglunde Here is the specification of the websockets-api . However, I am not sure, if it states, how the types get encoded, but my guess, everything that can't be represented by JSON without loss, will be a string.

tkilias commented 2 years ago

@tglunde I found actually a description of what values you can expect in the JSON.

<string | number | true | false | null>

And a number only has a limited precision in JSON, so if the Column needs more precision, the value will be a string. And, types like HASHTYPE or GEOMETRY are then also strings.

https://github.com/exasol/websocket-api/blob/master/docs/commands/fetchV1.md

tglunde commented 2 years ago

Hi. I have implemented the workaround in dbt-exasol adapter, overwriting the get_result_from_cursor method, where I convert strings that have decimal or bigint into Decimal within python. Since this is only used for test results I am sure there will not be any use case with HASHTYPE or GEOMETRY coming up soon. (https://github.com/tglunde/dbt-exasol/pull/33/files# - in connections.py) Therefore we can close this issue here.