Closed jtcohen6 closed 1 year ago
Hey Jeremy, thanks for taking a pass at this! ngl that when I told Sung that I was capacity constrained atm I didn't think you were going to be the one to attempt this!
So you've outlined the macro problem here very well-- DuckDB did a very literal interpretation of the DB API spec and does not provide the more detailed type information that ~ every other major database does. The consequence of this is that dbt-duckdb would provide a qualitatively worse implementation of model contracts than any of the other adapters, which is unsatisfying for me in about a half-dozen different ways.
There is something of a way around it though, although it's some work to do, and given some other stuff I have going on right now I didn't want to start in on it under a deadline of having it ready in a week. I originally ran into this problem (I needed a more detailed description
field from the cursor) when I started working on Buena Vista, my Presto/Postgres Python Proxy Project. After starting out by doing the dumb thing (i.e., just looking at the actual values returned by the fetch
calls), I hit upon a better idea: I could fetch the results using DuckDB's support for Arrow, which does allow me to get the detailed type information I need.
I didn't think you were going to be the one to attempt this!
I woke up a bit too early this morning, caught the thread, and had some fun. Always good for me to duck around a bit & feel out what it's like to extend our new stuff.
The consequence of this is that dbt-duckdb would provide a qualitatively worse implementation of model contracts than any of the other adapters, which is unsatisfying for me in about a half-dozen different ways.
Right on - I'm glad we're on the same page. I didn't realize you'd already had a chance to look into this, given the related work over in BV. I won't be offended if we need to scrap a lot of / all of the code that's here.
I could fetch the results using DuckDB's support for Arrow, which does allow me to get the detailed type information I need.
This looks cool!! where there's a Wills, there's a way
Ok - I'm happy to poke around your BV implementation, or let you take it from here - let me know
Yeah I will eventually get to it, it's really just a question of when, and it's not realistically going to be in the next couple of weeks.
I feel bad about this, as I know folks are starting to actually rely on me to do things with this project (which remember, started as a fun little toy project for me to learn about how databases worked!) in a semi-timely fashion. But for the moment, it's gonna have to do.
Heard - and totally agree with your call to ship minimal v1.5.0 compat next week, without model contracts! I still may poke around, for my own fun.
@Mause apologies for pulling you in here, please feel free to respond next week when you're back at work, but I was wondering if y'all had given any thought to revising the level of detailed type information that the description
field returns on the PyConnection
? I'm not sure if you saw the hack I came up with to get at this in Buena Vista using the Arrow schema returned on the RecordBatchReader, and I'm debating doing that same trick here in dbt-duckdb vs. maybe just going into the DuckDB python code and trying to come up with a fix there. :bow:
@jtcohen6 I tremendously appreciate the head start here, and the constraints stuff as-specified looks 💯 . If you don't mind, I'd be happy to hack on the stuff I spec'd out in the comments on your branch, or if you prefer I can grab the patch of this PR and take things in the direction I want them to go in on a net-new branch.
Mr Wills, you have the helm
having mixed feelings about the fact that my baby has grown to the point where it has a flaky integration test 😭
Thank you so much for all of the help on this @jtcohen6! Looking forward to getting this in folks hands tomorrow or Friday!
Overview
There are two main components of "model contracts" in v1.5:
constraints
, included in the templated DDL during table creation(2) is pretty trivial to implement, since DuckDB supports constraints with a standard syntax. What's very cool is that DuckDB actually enforces primary/unique key constraints, which most columnar databases / analytical platforms really don't. It also enforces row-level
check
constraints. So this could be a true swap-out fornot_null
,unique
,accepted_values
tests.(1) was a bit trickier. I reimplemented
get_column_schema_from_query
method, in lieu of implementingdata_type_to_code
, because theduckdb
Python cursor returns the data type as a string, rather than an integer code. The other unfortunate thing is that some of the type precision seems to be lost:That will still catch when users implicitly switch out, e.g., a
has_first_order
column from being0
/1
to beingtrue
/false
. But it can't catch subtler distinctions like int/decimal, timestamp with/without tz, and the contents of lists/structs. That degree of precision is maintained by most other database's connection cursors. Very open to ideas - I'm sure others know more about this than I do.Example