dbt-labs / dbt-core

dbt enables data analysts and engineers to transform their data using the same practices that software engineers use to build applications.
https://getdbt.com
Apache License 2.0
9.62k stars 1.59k forks source link

[CT-1918] [Discussion] Unify constraints and constraints_check configs #6750

Closed MichelleArk closed 1 year ago

MichelleArk commented 1 year ago

In the foundational dbt constraints work,constraints: List[str] and constraints_check: str are optional column configs that configure column-level constraints and row-level checks respectively.

Let's unify these two configs into a single optional constraints ([List[Union[str, Dict[str, str]]]]) that encompasses both sets of functionality. Relevant discussion: https://github.com/dbt-labs/dbt-core/pull/6271#discussion_r1084033137

Before:

columns:
  - name: price
    data_type: numeric
    constraints:
      - not null
    constraints_check: price > 0

After:

columns:
  - name: price
    data_type: numeric
    constraints:
      - not null
      - check: price > 0
        name: positive_price  # support database-specific check options (Postgres supports naming constraints)
jtcohen6 commented 1 year ago

I definitely think this will be a step in the right direction, by unifying into a single constraints attribute.

I'm not 100% convinced that [List[Union[str, Dict[str, str]]]] is the perfect data structure either, though it might be close.

MichelleArk commented 1 year ago

Instead of just accepting strings like "not null," would we want this to be not_null: true|false? Would that make it easier or harder for us to compare constraints across project states, in order to detect breaking changes to contracts in CI and require a version bump?

I'm all for standardizing on a separate not_null: true|false (or nullable: true|false) config - that would definitely make it more straightforward to detect breaking changes.

jtcohen6 commented 1 year ago

I'm all for standardizing on a separate not_null: true|false (or nullable: true|false) config - that would definitely make it more straightforward to detect breaking changes.

In addition to standardizing across data platforms, which either call this not null (most) or nullable (BigQuery). (Update: BigQuery calls this nullable in its API, but within the create table statement, it's just not null.)

A related thought here, prompted by https://github.com/dbt-labs/dbt-snowflake/pull/341#discussion_r1057422851. We should create methods/mechanisms for each adapters to say:

columns:
  - name: id
    data_type: numeric
    constraints:

      # on all data platforms (even ones that call this 'nullable')
      - not_null: true

      # only allowed for data platforms that actually enforce 'check' constraints
      - check: "id > 0"

      # everything else, for metadata purposes (and sometimes query optimization ... oh boy)
      - unenforced: primary key
      - unenforced: foreign key

Note that BigQuery will actually raise an explicit error if you try to set constraints that aren't enforced! Personally, I'm supportive of this, but it's another platform-specific behavior to be aware of:

create table dbt_jcohen.my_cool_table (id int primary key) as (select 1 as id);
Enforcement of primary keys is not supported
emmyoop commented 1 year ago

This will require follow up on all the adapters to determine what needs to be changed there.

jtcohen6 commented 1 year ago

summarizing from what I remember of discussion yesterday

Things we agree on:

Questions:

  1. Should the constraint "type" be inferred from the presence of a key (check, not_null, unenforced), or should this be an explicit key-value pair (type: check, type: not_null, type: <anything else>)?
    • not_null: true|false, or just not_null as a string (like the generic test)? When would you ever add not_null: false?
  2. Should not_null be its own top-level boolean attribute of the column, rather than nested under constraints?
    • Pro: DB-API2 / PEP 249 has null_ok as an integral column attribute, alongside name and data_type.
    • Pro: not_null is only column-level. Unlike check constraints, or custom unenforced constraints, it doesn't ever make sense to define at the model level. That could be one reason to define & treat it separately from the more generic constraints data structure. (E.g. On Databricks, not null constraints are applied column-level as alter table alter column ... set not null, versus check constraints which are applied table-level as alter table add constraint ...)
    • Con: I worry that this obfuscates how not_null is actually being implemented behind-the-scenes. It's still operating as a constraint, with all the other gotchas & caveats of other constraint types — e.g. not supported on views, only supported as a "post-flight" check on Databricks

@MichelleArk Let's aim to lock down answer these questions by next week, so that we can make these changes sooner rather than later — ideally by b2 (March 1), otherwise by b3 (March 15).

@sungchun12 @dave-connors-3 @Victoriapm @b-per Happy to hear any opinions / instincts / leanings you might have!

MichelleArk commented 1 year ago
  1. Should not_null be its own top-level boolean attribute of the column, rather than nested under constraints?
    • Con: I worry that this obfuscates how not_null is actually being implemented behind-the-scenes. It's still operating as a constraint, with all the other gotchas & caveats of other constraint types — e.g. not supported on views, only supported as a "post-flight" check on Databricks

Is this already true for data_type as a column-level attribute? Not necessarily a reason to double down, but just pointing out that there is some precedence here and that not_null on the column-level would be consistent with data_type in terms of functional behaviour.

jtcohen6 commented 1 year ago

Yes, data_type is already a column-level attribute! I agree that's a compelling reason to keep the two together.

If we can make #6751 happen — the way we'll be verifying the "contract" for data types, as a pre-flight check, will differ from the way we verify not_null (passing into data platform, for enforcement during create / insert / merge / etc). The data type enforcement won't actually be via a "constraints"—in the sense of, the database capability named "constraints"—whereas the not_null enforcement would very much be a "constraint.". We shouldn't make too big a deal of that; the data platform vendors not infrequently use the same familiar words in subtly different ways.

sungchun12 commented 1 year ago
  • When detecting breaking changes to model contracts, we ignore all unenforced constraints
columns:
  - name: id
    data_type: numeric
    constraints:

      # on all data platforms (even ones that call this 'nullable')
      - not_null: true

      # only allowed for data platforms that actually enforce 'check' constraints
      - check: "id > 0"

      # everything else, for metadata purposes (and sometimes query optimization ... oh boy)
      - unenforced: primary key
      - unenforced: foreign key

Note that BigQuery will actually raise an explicit error if you try to set constraints that aren't enforced! Personally, I'm supportive of this, but it's another platform-specific behavior to be aware of:

create table dbt_jcohen.my_cool_table (id int primary key) as (select 1 as id);
Enforcement of primary keys is not supported

Postgres enforces primary and foreign keys. Would we then implement formal config constraints like: primary_key: true and foreign_key: true?

Also, I'm a big fan of housing all the constraints config under a single config vs. decoupling constraints and checks

jtcohen6 commented 1 year ago

@sungchun12 Really good point that Postgres actually does enforce primary key and foreign key constraints! There aren't any other analytical data platforms that enforce those constraints, and I don't think we should design this feature primarily for Postgres. That said, we could imagine a future where those data platforms do support all four of the ANSI SQL standard constraints—unique, primary key, foreign key, not null—plus custom row-level check constraints as well.

So we do need some way for each adapter to tell us which constraints it's actually able to enforce at build time. For Postgres, that's all of them. For Redshift, Snowflake, and BigQuery, it's just not_null. For Databricks, per the current implementation in dbt-spark / dbt-databricks, it's actually none of them—even though not null and check constraints are enforced, they aren't enforceable until after atomically materializing a table—but we're talking to their team about ways they/we could make this better.

Then, we can use that adapter-level check to tell us:

Decision: not_null is special

Update: This decision has been reversed, after further discussion below. I'm leaving the rest of this comment as-is for posterity. That includes my original reasoning here, and the implications for the spec further down. We'll be creating a set of new issues to follow up on this discussion.

Why?

So I think we should "promote" not_null to be a column-level attribute, outside the constraints config. In the meantime, we will present & document that constraints varies greatly by data platform — caveat usor, passibus vestra variari.

Data structure

Constraint is a new object type. Pseudo code:

class ConstraintOption(StrEnum):
    check = "check"
    unique = "unique"
    primary_key = "primary_key"
    foreign_key = "foreign_key"
    custom = "custom"  # I have no idea -- but let's leave it open for the wacky & wild data platforms out there

ColumnLevelConstraint:
    type: ConstraintOption
    name: Optional[str]
    warn_unenforced: bool = True  # raise a warning if this constraint is not enforced by this data platform (but will still be included in templated DDL)
    warn_unsupported: bool = True  # raise a warning if this constraint is not supported by this data platform (and will be excluded from templated DDL)
    expression: Optional[str] # free for all. required if constraint is 'check' or 'foreign_key' -- can we validate via post-init ?
    # additional properties allowed: yes! there are going to be lots of constraint-specific and platform-specific shenanigans

# the same, with an added 'columns' attribute (which isn't needed when the constraint is defined for one column)
ModelLevelConstraint(ColumnLevelConstraint):
    columns: List[str]
models:
  constraints: Optional[List[ModelLevelConstraint]]
  columns:
    - name: str
      data_type: Optional[str]
      not_null: Optional[bool]
      constraints: Optional[List[ColumnLevelConstraint]]

Notes:

New adapter method

We need a new adapter method that can tell dbt whether a constraint is:

  1. supported & enforced
  2. supported but not enforced
  3. not supported

Here's what that might look like for dbt-snowflake, in pseudo code:

@classmethod
def constraint_support:
    return {
        "not_null": Constraint.ENFORCED,
        "primary_key": Constraint.NOT_ENFORCED,
        "foreign_key": Constraint.NOT_ENFORCED,
        "unique": Constraint.NOT_ENFORCED,
        "check": Constraint.NOT_SUPPORTED,
    }

As noted at the top, this method would be called in two places at runtime:

  1. During node selection, to determine which constraint diffs qualify as breaking changes to the model's contract (https://github.com/dbt-labs/dbt-core/issues/6869)
  2. during materialization, to raise warnings for unenforced/unsupported constraints (unless the user has disabled)

The assumption is that not_null is supported & enforced for all data platforms. If there's a data platform that doesn't enforce/support it, that platform's adapter will need to own the additional logic for handling users who set not_null: true on columns.

Examples

I'm using dbt-postgres. Everything is supported & enforced. The removal or modification* of any existing constraint should be considered a breaking change. This includes check constraints. (Should we be nice and say, wellllll if the check constraint has the same name, even if the expression has changed, it's probably the same?)

models:
  - name: my_model
    columns:
      - name: id
        data_type: int
        not_null: true
        constraints:
          - type: primary_key
          - type: check
            expression: ">0"
            name: id_never_negative
      ...
    constraints:
      - name: my_compound_foreign_key
        type: foreign_key
        columns: [customer_id, customer_region_id]
        # we could be cleverer about the structure here, but I'm hesitant -- want to find a stopping point
        expression: "customers_by_region (id, region_id)" # can also include things like 'ON DELETE SET NULL ...'

dbt will produce SQL like:

create table my_schema.my_model (
    id int not null primary key constraint id_never_negative (id > 0),
    ...
    constraint my_compound_foreign_key foreign key (customer_id, customer_region_id) references customers_by_region (id, region_id)
);

I'm using dbt-snowflake. I want to add an unenforced primary_key constraint to my id column. I'm well aware it doesn't do anything—I'm putting it there for integration with a data catalog tool—and I just want to stop seeing the warning. If I remove it later on, it won't be considered a breaking change, because it's not really part of the model contract. However, if I change not_null: true, that would be considered a breaking change—it's an integral part of the contract, enforced at build time.

columns:
  - name: id
    data_type: integer
    not_null: true
    constraints:
      - type: primary_key
        warn_unenforced: false  # make the warning disappear

dbt will produce SQL like:

create table my_schema.my_model (
    id int not null primary key,
    ...
);

Questions?

Drop a comment here, or let's discuss in Slack! I want to get us unblocked for estimation & implementation here ASAP. Given the amount I ended up writing in this comment, it might make sense to split up this work into multiple tickets.

sungchun12 commented 1 year ago

@jtcohen6 Replying here for traceability, and let's get down and dirty in slack!

So we do need some way for each adapter to tell us which constraints it's actually able to enforce at build time. For Postgres, that's all of them. For Redshift, Snowflake, and BigQuery, it's just not_null. For Databricks, per the current implementation in dbt-spark / dbt-databricks, it's actually none of them—even though not null and check constraints are enforced, they aren't enforceable until after atomically materializing a table—but we're talking to their team about ways they/we could make this better.

Then, we can use that adapter-level check to tell us:

If this constraint is actually going to do anything. We should warn the user about constraints that aren't both supported and enforced. (We should also give them a way to silence those warnings.) Whether this constraint type can be considered part of the model's contract (set of build-time guarantees), and whether removing it qualifies as a breaking change to that contract (https://github.com/dbt-labs/dbt-core/issues/6869)

I agree with telling the user which constraints are enforceable at build time. Probably a terminal logging message that lets the user know clearly what's NOT being enforced and points to the specific config shouting this warning vs. a generic, "Hey you're using primary key somewhere in the mix and that's a no no. Good luck finding it". I also like the extra flavor of whether it qualifies as a breaking change to the contract. That opinionated flavor is just the right touch to make this a healthy guardrail during development.

Decision: not_null is special Why?

It's supported & enforced by all our data platforms. "A not-null constraint is always written as a column constraint." (Postgres docs) It's one of the 7 deadly descriptors integral attributes of columns, per the DB-API2 spec. So I think we should "promote" not_null to be a column-level attribute, outside the constraints config. In the meantime, we will present & document that constraints varies greatly by data platform — caveat usor, passibus vestra variari.

I understand the implementation reasoning, but the developer reasoning conflicts with this logic in practice. If I'm a developer building constraints for the first time OR adjusting constraints in an existing contract, I'll be wondering why not_null is outside of my constraints config hierarchy. I'll think to myself, "Is not_null not a constraint or a test, wait can I do that with unique, can I do that with anything custom?" And let's say I want to build functionality and call the manifest to run dbt commands based on constraints configs, how do I ensure not_null is flowing through properly in my contract without second guessing where the config is placed?

Notes:

constraints is an optional attribute at both the model level and the column level

Agreed

If constraints is set a the model level, it should be a model-level attribute, not a "config." This means it isn't settable in {{ config() }} or in dbt_project.yml, and that's okay.

Agreed AND I'd like to see +contracts: true settable in dbt_project.yml that points the developer to define constraints at the model level

warn_unenforced & warn_unsupported could alternatively beOptional[bool], and flipped/renamed to ignore_warning_unenforced&ignore_warning_unsupported`, so that the data is only included when it's a non-default value

Agreed on functionality, I recommend warn_unenforced & warn_unsupported as it's easier to reason about the bool logic at a glance

not_null is a boolean column-level attribute. It is NOT model-level, and it is NOT included in the constraints data structure!

Agreed on boolean column-level attribute and NOT model-level. Do NOT agree on constraints config hierarchy.

That's about as smart as I feel like being for now. (And even this might still be too smart.) I don't care to do any detection of, e.g., multiple unique constraints being defined for a single column, or primary_key constraints being defined for multiple columns in a single model. All of that will be caught at runtime, and raise an error in the data platform. I want the minimum viable data structure that enables us to detect breaking changes to model contracts.

Agreed, we don't want to provide so many guardrails that we create a prison and not the fun sandbox it should be. Thankfully, the databases provide helpful tips when constraints are wrong based on ad hoc testing.

We need a new adapter method that can tell dbt whether a constraint is:

supported & enforced supported but not enforced not supported

Agreed on these primitives!

I'm using dbt-postgres. Everything is supported & enforced. The removal or modification* of any existing constraint should be considered a breaking change. This includes check constraints. (Should we be nice and say, wellllll if the check constraint has the same name, even if the expression has changed, it's probably the same?)

I'd say BREAKING change because the substance of the contract has changed because the expression changed. I'd be okay with it NOT being a breaking change if ONLY the name of the check is different but the expression remains the same.

jtcohen6 commented 1 year ago

@sungchun12 Love this feedback! The only place where you + I disagree is about whether not_null should be "special." To summarize the arguments for/against:

@MichelleArk @dbeatty10 Very open to hearing your thoughts / votes one way or the other!

Given that this is the only outstanding question, I think we can move this issue out of "refinement" and into "estimation." We may want to update the original issue description with the final proposed spec, or open as a new "clean" issue.

b-per commented 1 year ago

I would be more in favor of "not special".

If more warehouses start to enforce additional constraints, e.g. unique, people might start wondering why not_null and unique are configured differently when they actually are both enforced by the warehouse.

jtcohen6 commented 1 year ago

@b-per Thanks for weighing in!!

Let me flip that around: If that happens — or shall we say "when" :) — what would you think about promoting unique and primary_key as first-class column-level attributes? Alongside not_null, data_type, name. These are all things that are guaranteed to be true about the column.

In the current proposal, constraints is intentionally a more-flexible "grab bag": these are things that might be true, according to you, with support that varies across platforms.

b-per commented 1 year ago

With this approach, are we saying that when a warehouse starts supporting primary_key we will need to change dbt-core (and/or dbt-warehouse) to get it supported in dbt? Or are you saying that we would make those available from the get go, even if they don't work today? What I think would be a not so great experience would be to have to wait for the next dbt minor version to get support added once it is done on the Warehouse side.

On the first-class attribute list, would we also consider foreign_key?

jtcohen6 commented 1 year ago

@b-per I'm not aware of any columnar / analytical data platforms that can enforce constraints for uniqueness, primary keys, and foreign keys. (I think they can be sorta supported in columnar extensions/forks of transactional databases, with some gotchas — e.g. Citus docs — but just as often not, e.g. MariaDB column store.)

So I think I'd be open to adding these as first-class attributes once there's been a precedent set by at least one (ideally two) of the major data platform players. If/when that happens, I hear you on it not being a delightful experience to have to wait for a new version of dbt-core. (Functionally, your primary_key constraint would be settable within the "grab bag" constraints config—and we'd keep this around for backward compability—but you'd need to wait for that new version to "promote" its definition to a top-level attribute.)

MichelleArk commented 1 year ago

@jtcohen6 I worry that this obfuscates how not_null is actually being implemented behind-the-scenes. It's still operating as a constraint, with all the other gotchas & caveats of other constraint types — e.g. not supported on views, only supported as a "post-flight" check on Databricks

@sungchun12 If I'm a developer building constraints for the first time OR adjusting constraints in an existing contract, I'll be wondering why not_null is outside of my constraints config hierarchy. I'll think to myself, "Is not_null not a constraint or a test, wait can I do that with unique, can I do that with anything custom?"

These are super valid usability concerns, and to me outweigh the benefits to promoting not-null to a column-level attribute. That said, I do still think not null is 'special' for the reasons stated above: it's column-level only; supported & enforced on all data platforms; part of DB-API2 standard for integral column metadata.

My motivation for seeing not null as special is so that its presence can be used as input to detecting breaking changes to model contracts as part of the state:modified check. But it doesn't need to be a column-level attribute to implement that!

To me this comes down to what is part of the model contract (enforceable pre-flight, part of the state:modified check) vs model constraints (warehouse specific behaviour + configuration). The new constraint_support adapter method encodes which constraints have Constraint.ENFORCED. Enforced constraints would be part of the model contract, and any modifications that relax constraints that are part of the model contract would be considered breaking.

dbeatty10 commented 1 year ago

As a summary, there are six (6) constraint types that have been discussed:

To help me process pros/cons, I made an attempt at re-phrasing points that have been raised so far.

Let's suppose that it should be special (and promoted to a column-level attribute):

Pros

Cons

My opinion: all constraints grouped together visually

TL;DR "it shouldn't be special"

From what I can tell so far, it seems like our code will be able to handle things either way. Please speak up if this is not that the case!

If so, for me it would come down to what type of visual grouping we want for human cognition:

  1. all constraints together, regardless of enforced or not
  2. constraints separated between enforced and not enforced
  3. constraints separated between those that must be single-column and those that may be multi-column

Most often, I come down on the side of whatever the standards suggest (which would be Option 3: visually separating single-column and multi-column). In this case, I'd lean towards visually grouping all constraints together since I think it will be more intuitive for users and it doesn't seem like we'd be giving up much if anything at all. (Maybe I should have done pros/cons from the "not special perspective instead!)

Ultimately, this is pretty loosely held for me, because I think our users will be successful either way as long as we have good docs and good error messages.

jtcohen6 commented 1 year ago

Upon further review, the ruling on the field is overturned! Thank you @sungchun12 @b-per @MichelleArk @dbeatty10 for weighing in with such thoughtful & well-articulated comments. Let's keep not_null nested within constraints for the foreseeable.

We may want to add primary_key and foreign_key as top-level column-level (or dimension-level) attributes, and thereby enable some semantic layer wizardry... but that's a conversation for another thread, another day :)

sungchun12 commented 1 year ago

@jtcohen6 thanks for listening and caring so much. This thread is such a lovely role model for people to debate healthily and thoughtfully!

jtcohen6 commented 1 year ago

Closing in favor of #7066 + #7067 for implementation!