dbt-labs / dbt-core

dbt enables data analysts and engineers to transform their data using the same practices that software engineers use to build applications.
https://getdbt.com
Apache License 2.0
9.89k stars 1.63k forks source link

[CT-2792] [Feature] Coerce data types in dbt Model Contracts #8035

Closed sungchun12 closed 1 year ago

sungchun12 commented 1 year ago

Is this your first time submitting a feature request?

Describe the feature

Problem: I want to coerce data types when defining model contracts to avoid writing out cast statements tediously.

However, I don't have an elegant mechanism to do that without writing cast statements.

Proposed Solution: Add a config to enable/disable strict mode (current behavior) similar to this example: https://docs.pydantic.dev/latest/blog/pydantic-v2/#strict-mode

version: 2

models:
  - name: demo
    description: demo
    config:
      contract:
        enforced: true
        strict: false # additional config

    columns:
      - name: id
        description: This is a unique identifier for a customer
        data_type: text
        constraints:
          - type: unique
          - type: not_null
--original SQL
select 2 as id
union all
select 1 as id
--compiled SQL
  create  table "neondb"."dbt_sung_dev"."demo__dbt_tmp"

  (
    id text unique not null

    )
 ;
    insert into "neondb"."dbt_sung_dev"."demo__dbt_tmp" (
      id
    )

  (

    select id
    from (
        select cast(2 as varchar) as id
union all
select cast(1 as varchar) as id
    ) as model_subq
  );

Describe alternatives you've considered

Writing cast statements explicitly in my SQL.

Who will this benefit?

Model Contract users who want to be lazy like me when defining data types.

Are you interested in contributing this feature?

Not for now! That may change when I start my new job.

Anything else?

No response

jtcohen6 commented 1 year ago

@sungchun12 Thanks for opening the issue! :)

For the initial rollout of "model governance" constructs in v1.5, I opted for more strict & opinionated across the board. We can choose to relax the constraints (hah), as folks present legitimate use cases — make it more flexible & configurable — but we can't go in reverse. I want to maintain a reasonably high standard for any proposed "relaxation," knowing that such flexibility & configurability also moves more of the complexity onto end users.

In this case, I do think you should be writing your :: / cast statements within your model SQL!

One of my strongly held beliefs about dbt materializations is that there should be a clean separation between business/transformation logic (model contents), which returns a dataset, and the operational logic to actualize (create/replace/update/merge) that dataset as an object in the database. You should be able to preview (show) a model, see the returned dataset, and trust that it is exactly what you'd see when dbt actually builds that model.

Adding implicit type coercion leads to a delta between the schema of the preview result, and the model actually being built. When that coercion is unsuccessful, it also makes those errors trickier to debug, because they'll only be encountered while materializing the model, not while previewing the compiled query.

I'm resistant to this for the same reason I'm resistant to defining a default value within a model contract, to take the place of any null values in the column — this should really be a coalesce in the model SQL! (I'm more sympathetic to cases where an incremental model's schema has evolved.)

sungchun12 commented 1 year ago

@jtcohen6 Super thoughtful! I didn't think of the developer experience in how previewing a coerced model via yaml would lead to more unnecessary friction. I wouldn't be happy weaving in changes across two files to get the job done. Explicit is better than implicit. Thanks! Glad we talked through it!