dbt-labs / dbt-core

dbt enables data analysts and engineers to transform their data using the same practices that software engineers use to build applications.
https://getdbt.com
Apache License 2.0
9.95k stars 1.63k forks source link

RPC server sometimes returns the wrong type #2984

Closed aaronraff closed 3 years ago

aaronraff commented 3 years ago

Describe the bug

While using the RPC server, some values are returned as the wrong type. It looks like strings are being cast to floats somehow.

Steps To Reproduce

  1. Start up the rpc server (dbt rpc)
  2. Make a POST request with the following body:
    {
    "jsonrpc": "2.0",
    "method": "run_sql",
    "id": "test3",
    "params": {
        "timeout": 60,
        "name": "test query",
        "sql": "d2l0aCBiYXNlIGFzICgKCiAgc2VsZWN0CgogICAgJ3Rlc3QnOjp2YXJjaGFyIGFzIHRlc3RfdmFyY2hhciwKCiAgICAyNTo6dmFyY2hhciBhcyB0ZXN0X2ludF92YXJjaGFyLAoKICAgICctMjUnOjp2YXJjaGFyIGFzIHRoaXNfd29ya3MKCikKCnNlbGVjdCAKCiAgKiwKCiAgTFBBRCh0ZXN0X3ZhcmNoYXIsIDEwLCAnMCcpIGFzIHBhZF92YXJjaGFyLAoKICBMUEFEKHRlc3RfaW50X3ZhcmNoYXIsIDEwLCAnMCcpIGFzIHBhZF9pbnRfdmFyY2hhciwKCiAgTFBBRCh0aGlzX3dvcmtzLCAxMCwgJzAnKSBhcyBwYWRfdGhpc193b3JrcwoKZnJvbSBiYXNlCg=="
    }
    }

The SQL here is the base64 encoded version of:

with base as (

  select

    'test'::varchar as test_varchar,

    25::varchar as test_int_varchar,

    '-25'::varchar as this_works

)

select 

  *,

  LPAD(test_varchar, 10, '0') as pad_varchar,

  LPAD(test_int_varchar, 10, '0') as pad_int_varchar,

  LPAD(this_works, 10, '0') as pad_this_works

from base
  1. Copy the request_token from the response

  2. Make another request with the following body:

    {
    "jsonrpc": "2.0",
    "method": "poll",
    "id": "test4",
    "params": {
        "request_token": <the-copied-request-token>,
        "logs": true,
        "logs_start": 0
    }
    }
  3. Inspect the rows that are returned. You should see something like this:

    "rows": [
    [
        "test",
        25.0,
        -25.0,
        "000000test",
        25.0,
        "0000000-25"
    ]
    ]
  4. If you run the same query in psql, you will get the following output:

    test_varchar | test_int_varchar | this_works | pad_varchar | pad_int_varchar | pad_this_works
    --------------+------------------+------------+-------------+-----------------+----------------
    test         | 25               | -25        | 000000test  | 0000000025      | 0000000-25
  5. Notice how pad_int_varchar is padded since it is represented as a varchar, and not an integer

Expected behavior

I would expect that all of the rows returned would be strings (since they were casted), and not floats. As seen in step 5, the second to last row is not padded since it is being represented as a float.

Screenshots and log output

I don't have any other relevant logs or output, but I'm happy to add more of the RPC response to this issue if that is helpful!

System information

Which database are you using dbt with?

The output of dbt --version:

installed version: 0.18.1
   latest version: 0.18.1

Up to date!

Plugins:
  - postgres: 0.18.1
  - spark: 0.18.1.1

The operating system you're using:

macOS Catalina Version 10.15.7

The output of python --version:

Python 3.7.7

Additional context

N/A

jtcohen6 commented 3 years ago

Thanks for this reproduction case @aaronraff! I have an even simpler one:

select '0000000025'::varchar as my_varchar_column

base64 encoded: c2VsZWN0ICcwMDAwMDAwMDI1Jzo6dmFyY2hhciBhcyBteV92YXJjaGFyX2NvbHVtbg==

"table": {
    "column_names": [
        "my_varchar_column"
    ],
    "rows": [
        [
            25.0
        ]
    ]
}

I think the issue here is with how the RPC server handles data types implicitly, rather than storing them explicitly alongside JSON results.

jtcohen6 commented 3 years ago

This is definitely an issue with agate. I dived a bit deeper into the example from above. The method that switches this from a padded string to an integer is table_from_data_flat, called by executeget_result_from_cursor:

https://github.com/fishtown-analytics/dbt/blob/1e5a7878e5188fa58f91e2f257041f28fdb948fc/core/dbt/adapters/sql/connections.py#L112-L125

Here's the simple reproduction case:

>>> import dbt.clients.agate_helper
>>> data = [{'my_varchar_column': '0000000025'}]
>>> column_names = ['my_varchar_column']
>>> agate_tbl = dbt.clients.agate_helper.table_from_data_flat(
...             data,
...             column_names
...         )
>>> agate_tbl.print_table()
| my_varchar_column |
| ----------------- |
|                25 |

Agate does a lot of type inference under the hood. We enable user-supplied column type overrides for seeds, but I don't think that makes a lot of sense for one-off RPC queries. Really, we should be getting the data type from the database, though that may mean handling some low-level differences across cursors. Here's what cursor.description looks like for Postgres + Redshift:

(Column(name='my_varchar_column', type_code=1043),)

Versus Snowflake:

[('MY_VARCHAR_COLUMN', 2, None, 16777216, None, None, False)]

Whereas other databases, e.g. BigQuery, reimplement adapter.execute and use other methods to convert fetched results to an agate table. So the intervention needed may vary.