dbt-labs / dbt-core

dbt enables data analysts and engineers to transform their data using the same practices that software engineers use to build applications.
https://getdbt.com
Apache License 2.0
9.82k stars 1.62k forks source link

[CT-2609] [Feature] Allow defining a default backfill value for incremental models with constraints #7732

Open b-per opened 1 year ago

b-per commented 1 year ago

Is this your first time submitting a feature request?

Describe the feature

When adding a new column to an incremental model (set with append_new_columns) with a not null constraint, the new rows are checked to see if the new column is NULL but the table itself doesn't get its overall metadata/DDL updated (which is normal as there are now NULL values for previous rows).

This is good as data is checked as configured, but the table attributes in the Warehouse then doesn't reflect the fact the new column is verified to be not null.

The feature suggested would be to:

  1. allow setting a default value for historical rows in case a new column is added
  2. alter the table to signify that the column has a not null constraint

Describe alternatives you've considered

Alternative 1

It is currently possible to achieve the same outcome by manually:

  1. running update statements to update the historical rows with NULL values
  2. running some alter statement on the table to update the column as being not null

But this approach requires writing DDL directly when the suggested feature would handle this automatically.

Alternative 2

Another approach is to run a --full-refresh but this might not be possible for huge tables.

Who will this benefit?

People wanting to leverage the new constraints capabilities and handling large amount of data stored in incremental models.

Are you interested in contributing this feature?

If need be

Anything else?

No response

dbeatty10 commented 1 year ago

This is an intriguing idea @b-per ! 💡

Database / schema migrations

The manual steps you described are basically like database / schema migrations in Django, Rails, and other similar frameworks. Incremental models already do similar operations as those migrations (i.e., adding columns like you described), but don't do others (like replacing NULLs with some default value).

Things to solve for

The piece of the design that seems most impactful is how to specify the value to use for the new column(s). There's a couple awkward things to solve for:

  1. There can be more than one column that is applicable
  2. Specifying which column(s) need default values
  3. This is essentially a one-time backfill per column and then never needed again

Specifying default values for not null constraints

Idea 1

One option would be to introduce a default_values dictionary that maps column names to a SQL expression to be used as the default value instead of NULLs.

It could look similar this:

models/dim_customers.sql

{{
    config(
        materialized='incremental',
        unique_key='id',
        on_schema_change='append_new_columns',
        default_values={
          "some_string_column": "'foobar'",
          "some_timestamp_column": "cast('0000-01-01' as timestamp)",
        }
    )
}}

Idea 2

An alternative could be to have the default value added to the configuration of the constraint itself, like this:

models/_models.yml

models:
  - name: dim_customers
    config:
      contract:
        enforced: true
    columns:
      - name: some_string_column
        data_type: varchar
        constraints:
          - type: not_null
            - default_value: "foobar"
      - name: some_timestamp_column
        data_type: timestamp
        constraints:
          - type: not_null
            - default_value: "cast('0000-01-01' as timestamp)"

Idea 3

Full-blown support for general database migrations. Such a feature would only be relevant to models materialized as incremental because ephemeral, table, and view don't need migrations.

Summary

I prefer idea 2 over idea 1 since there's already a full listing of the column names when a contract is enforced. It solves for the first two things listed in the "things to solve for" section. But neither idea solves for the one-time nature of these migrations. To solve for that one, we'd need idea 3. That typically involves storing state, which we've historically shied away from.

What do you think of those first two options versus each other @b-per? Did you have something different in mind?

b-per commented 1 year ago

Idea 1 and Idea 2 are tackling slightly different use cases

  1. allow it for all incrementals
  2. allow it for incrementals with constraints/contracts (second question, what about people who don't want an auto-backfill?)

Knowing that the original issue came more from a constraints use case I would be edging towards Idea 2, supporting it when contraints are defined.

Couple of additional ideas about it:

dbeatty10 commented 1 year ago

@b-per good insight about Idea 1 and Idea 2 tackling slightly different use cases. I agree with your assessment and edging towards Idea 2!

wouldn't it make more sense to call it backfill_value instead of default_value?

Calling it default_value (or just default) was inspired by databases that support specifying a default within DDL like Postgres and SQL Server.

if default_value/backfill_value is not set, continue like today, don't backfill, test the new rows for the constraint but don't set the constraint on the incremental table

I'm concerned that it doesn't actually add the constraint. It feels like it should be all or nothing. Otherwise, it's really more like a pre-insert/merge data test on a batch than a true constraint. Allowing specification of a not_null constraint without actually apply it is non-intuitive to the users. To reduce confusion requires more documentation, more log output from the system, etc.

What do you think?

Discussing the details here makes me wonder if this type of backfill activity is necessarily difficult. i.e., the Alternative 1 that you mentioned originally is definitely overhead for the end user, but it's clear to them what they are doing and how the system is behaving.

b-per commented 1 year ago

I think that default and backfill_value are different concepts here:

if default_value/backfill_value is not set, continue like today, don't backfill, test the new rows for the constraint but don't set the constraint on the incremental table

I'm concerned that it doesn't actually add the constraint. It feels like it should be all or nothing. Otherwise, it's really more like a pre-insert/merge data test on a batch than a true constraint. Allowing specification of a not_null constraint without actually apply it is non-intuitive to the users. To reduce confusion requires more documentation, more log output from the system, etc.

It might be confusing but it is the current behaviour. Would we be ok changing the current behaviour and making default_value/backfill_value mandatory for incremental models with append_new_columns? Would this be considered a breaking change? (technically nothing will break until new columns are added, but if they are added without the default_value/backfill_value it will then break.

Discussing the details here makes me wonder if this type of backfill activity is necessarily difficult. i.e., the Alternative 1 that you mentioned originally is definitely overhead for the end user, but it's clear to them what they are doing and how the system is behaving.

I think that not having it managed in dbt and letting people do it by hand is difficult. dev is easy, but what about CI/CD, what about Prod? Having it in dbt makes it part of your logic and makes it go through all the steps of moving a change from dev to prod.

github-actions[bot] commented 10 months ago

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days.

github-actions[bot] commented 10 months ago

Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest. Just add a comment to notify the maintainers.

b-per commented 9 months ago

I think that this is still worth considering, hence why I reopen it but feel free to close it for good!