duckdb / dbt-duckdb

dbt (http://getdbt.com) adapter for DuckDB (http://duckdb.org)
Apache License 2.0
895 stars 83 forks source link

Add support for model contracts in v1.5 #148

Closed jtcohen6 closed 1 year ago

jtcohen6 commented 1 year ago

Overview

There are two main components of "model contracts" in v1.5:

  1. A "pre-flight" check that dbt will perform, before creating a view or table, to validate that column names and data types for the query match the ones defined in yaml
  2. Support for platform-specific constraints, included in the templated DDL during table creation

(2) is pretty trivial to implement, since DuckDB supports constraints with a standard syntax. What's very cool is that DuckDB actually enforces primary/unique key constraints, which most columnar databases / analytical platforms really don't. It also enforces row-level check constraints. So this could be a true swap-out for not_null, unique, accepted_values tests.

(1) was a bit trickier. I reimplemented get_column_schema_from_query method, in lieu of implementing data_type_to_code, because the duckdb Python cursor returns the data type as a string, rather than an integer code. The other unfortunate thing is that some of the type precision seems to be lost:

> cursor.description
('my_int', 'NUMBER', None, None, None, None, None)
('my_decimal', 'NUMBER', None, None, None, None, None)
('my_bool', 'bool', None, None, None, None, None)
('my_timestamp', 'DATETIME', None, None, None, None, None)
('my_timestamp_tz', 'DATETIME', None, None, None, None, None)
('my_list_of_strings', 'list', None, None, None, None, None)
('my_list_of_numbers', 'list', None, None, None, None, None)
('my_struct', 'dict', None, None, None, None, None)

That will still catch when users implicitly switch out, e.g., a has_first_order column from being 0/1 to being true/false. But it can't catch subtler distinctions like int/decimal, timestamp with/without tz, and the contents of lists/structs. That degree of precision is maintained by most other database's connection cursors. Very open to ideas - I'm sure others know more about this than I do.

Example

select

    1 as my_int,
    2.0 as my_decimal,
    true as my_bool,
    '2013-11-03 00:00:00-07'::timestamp as my_timestamp,
    '2013-11-03 00:00:00-07'::timestamptz as my_timestamp_tz,
    ['a','b','c'] as my_list_of_strings,
    [1,2,3] as my_list_of_numbers,
    {'bar': 'baz', 'balance': 7.77, 'active': false} as my_struct
models:
  - name: my_model
    config:
      contract:
        enforced: true
    columns:
      - name: my_int
        data_type: int  # decimal also works here - cursor returns as 'NUMBER'
      - name: my_decimal
        data_type: decimal  # int also works here - cursor returns as 'NUMBER'
      - name: my_bool
        data_type: bool
      - name: my_timestamp
        data_type: timestamp  # timestamptz also works here - cursor returns as 'DATETIME'
      - name: my_timestamp_tz
        data_type: timestamptz  # timestamp/datetime also works here - cursor returns as 'DATETIME'
      - name: my_list_of_strings
        data_type: text[]     # int[] also works here - cursor returns as 'list'
      - name: my_list_of_numbers
        data_type: int[]    # text[] also works here - cursor returns as 'list'
      - name: my_struct
        data_type: struct(bar text, balance decimal, active boolean)  # struct(any text) also works here - cursor returns as 'dict'
jwills commented 1 year ago

Hey Jeremy, thanks for taking a pass at this! ngl that when I told Sung that I was capacity constrained atm I didn't think you were going to be the one to attempt this!

So you've outlined the macro problem here very well-- DuckDB did a very literal interpretation of the DB API spec and does not provide the more detailed type information that ~ every other major database does. The consequence of this is that dbt-duckdb would provide a qualitatively worse implementation of model contracts than any of the other adapters, which is unsatisfying for me in about a half-dozen different ways.

There is something of a way around it though, although it's some work to do, and given some other stuff I have going on right now I didn't want to start in on it under a deadline of having it ready in a week. I originally ran into this problem (I needed a more detailed description field from the cursor) when I started working on Buena Vista, my Presto/Postgres Python Proxy Project. After starting out by doing the dumb thing (i.e., just looking at the actual values returned by the fetch calls), I hit upon a better idea: I could fetch the results using DuckDB's support for Arrow, which does allow me to get the detailed type information I need.

jtcohen6 commented 1 year ago

I didn't think you were going to be the one to attempt this!

I woke up a bit too early this morning, caught the thread, and had some fun. Always good for me to duck around a bit & feel out what it's like to extend our new stuff.

The consequence of this is that dbt-duckdb would provide a qualitatively worse implementation of model contracts than any of the other adapters, which is unsatisfying for me in about a half-dozen different ways.

Right on - I'm glad we're on the same page. I didn't realize you'd already had a chance to look into this, given the related work over in BV. I won't be offended if we need to scrap a lot of / all of the code that's here.

I could fetch the results using DuckDB's support for Arrow, which does allow me to get the detailed type information I need.

This looks cool!! where there's a Wills, there's a way

Ok - I'm happy to poke around your BV implementation, or let you take it from here - let me know

jwills commented 1 year ago

Yeah I will eventually get to it, it's really just a question of when, and it's not realistically going to be in the next couple of weeks.

I feel bad about this, as I know folks are starting to actually rely on me to do things with this project (which remember, started as a fun little toy project for me to learn about how databases worked!) in a semi-timely fashion. But for the moment, it's gonna have to do.

jtcohen6 commented 1 year ago

Heard - and totally agree with your call to ship minimal v1.5.0 compat next week, without model contracts! I still may poke around, for my own fun.

jwills commented 1 year ago

@Mause apologies for pulling you in here, please feel free to respond next week when you're back at work, but I was wondering if y'all had given any thought to revising the level of detailed type information that the description field returns on the PyConnection? I'm not sure if you saw the hack I came up with to get at this in Buena Vista using the Arrow schema returned on the RecordBatchReader, and I'm debating doing that same trick here in dbt-duckdb vs. maybe just going into the DuckDB python code and trying to come up with a fix there. :bow:

jwills commented 1 year ago

@jtcohen6 I tremendously appreciate the head start here, and the constraints stuff as-specified looks 💯 . If you don't mind, I'd be happy to hack on the stuff I spec'd out in the comments on your branch, or if you prefer I can grab the patch of this PR and take things in the direction I want them to go in on a net-new branch.

jtcohen6 commented 1 year ago

Mr Wills, you have the helm

jwills commented 1 year ago

having mixed feelings about the fact that my baby has grown to the point where it has a flaky integration test 😭

jwills commented 1 year ago

Thank you so much for all of the help on this @jtcohen6! Looking forward to getting this in folks hands tomorrow or Friday!