[CT-1918] [Discussion] Unify constraints and constraints_check configs

In the foundational dbt constraints work,constraints: List[str] and constraints_check: str are optional column configs that configure column-level constraints and row-level checks respectively.

Let's unify these two configs into a single optional constraints ([List[Union[str, Dict[str, str]]]]) that encompasses both sets of functionality. Relevant discussion: https://github.com/dbt-labs/dbt-core/pull/6271#discussion_r1084033137

Before:

columns:
  - name: price
    data_type: numeric
    constraints:
      - not null
    constraints_check: price > 0

After:

columns:
  - name: price
    data_type: numeric
    constraints:
      - not null
      - check: price > 0
        name: positive_price  # support database-specific check options (Postgres supports naming constraints)

I definitely think this will be a step in the right direction, by unifying into a single constraints attribute.

I'm not 100% convinced that [List[Union[str, Dict[str, str]]]] is the perfect data structure either, though it might be close.

On Postgres, each column can define exactly one check constraint, but then the table can define as many additional check constraints as it wants (named or unnamed)
On Databricks, check constraints are named and table-level only. They also support several additional parameters, which makes me think that defining each constraint entry as a dictionary is the right move.
Instead of just accepting strings like "not null," would we want this to be not_null: true|false? Would that make it easier or harder for us to compare constraints across project states, in order to detect breaking changes to contracts in CI and require a version bump?

Instead of just accepting strings like "not null," would we want this to be not_null: true|false? Would that make it easier or harder for us to compare constraints across project states, in order to detect breaking changes to contracts in CI and require a version bump?

I'm all for standardizing on a separate not_null: true|false (or nullable: true|false) config - that would definitely make it more straightforward to detect breaking changes.

I'm all for standardizing on a separate not_null: true|false (or nullable: true|false) config - that would definitely make it more straightforward to detect breaking changes.

In addition to standardizing across data platforms, which either call this not null (most) or nullable (BigQuery). (Update: BigQuery calls this nullable in its API, but within the create table statement, it's just not null.)

A related thought here, prompted by https://github.com/dbt-labs/dbt-snowflake/pull/341#discussion_r1057422851. We should create methods/mechanisms for each adapters to say:

The string to use for a not-null constraint (not null, nullable, null ok, something else)
The full set of constraint types that are actually enforced. Today, I think it's just not_null & check
Users should be able to specify unenforced constraints, too, but IMO this should be explicit in the code used to define them. This can be Any data type—generally freeform text, though Databricks' several additional options might want a dict. It's on the user to configure (or template out) a string that their data platform will accept. Naming: unenforced, meta_only, open to other ideas.
When detecting breaking changes to model contracts, we ignore all unenforced constraints

columns:
  - name: id
    data_type: numeric
    constraints:

      # on all data platforms (even ones that call this 'nullable')
      - not_null: true

      # only allowed for data platforms that actually enforce 'check' constraints
      - check: "id > 0"

      # everything else, for metadata purposes (and sometimes query optimization ... oh boy)
      - unenforced: primary key
      - unenforced: foreign key

Note that BigQuery will actually raise an explicit error if you try to set constraints that aren't enforced! Personally, I'm supportive of this, but it's another platform-specific behavior to be aware of:

create table dbt_jcohen.my_cool_table (id int primary key) as (select 1 as id);

Enforcement of primary keys is not supported

This will require follow up on all the adapters to determine what needs to be changed there.

summarizing from what I remember of discussion yesterday

Things we agree on:

constraints should be a list of dictionaries
Each dictionary entry must include an identifier for the type of the constraint. The options are not_null, check, or unenforced (anything else).
These dictionaries can also contain additional key-value pairs, for data platform specific behavior. The most common one is name, for platforms that support named constraints. Databricks has a whole constraint_option situation in addition ("not enforced", "deferrable," etc).
At parse time, we validate the contents & types of the constraints attribute.
At runtime, we raise a warning if the user has specified a check constraint on a data platform that doesn't support it. Ideally, we'd do this in a way that doesn't require lots of code duplication across adapters.
constraints should be supported as a column-level attribute and as a model-level configuration. (Some constraints can be defined at the "table level" and include multiple columns.) We should use the same data structure (same dataclass definition & validation) in both places.
Whenever a constraint is on a single column, we should strongly encourage defining it at the column-level; this is always the case for not_null.

Questions:

Should the constraint "type" be inferred from the presence of a key (check, not_null, unenforced), or should this be an explicit key-value pair (type: check, type: not_null, type: <anything else>)?
- not_null: true|false, or just not_null as a string (like the generic test)? When would you ever add not_null: false?
Should not_null be its own top-level boolean attribute of the column, rather than nested under constraints?
- Pro: DB-API2 / PEP 249 has null_ok as an integral column attribute, alongside name and data_type.
- Pro: not_null is only column-level. Unlike check constraints, or custom unenforced constraints, it doesn't ever make sense to define at the model level. That could be one reason to define & treat it separately from the more generic constraints data structure. (E.g. On Databricks, not null constraints are applied column-level as alter table alter column ... set not null, versus check constraints which are applied table-level as alter table add constraint ...)
- Con: I worry that this obfuscates how not_null is actually being implemented behind-the-scenes. It's still operating as a constraint, with all the other gotchas & caveats of other constraint types — e.g. not supported on views, only supported as a "post-flight" check on Databricks

@MichelleArk Let's aim to lock down answer these questions by next week, so that we can make these changes sooner rather than later — ideally by b2 (March 1), otherwise by b3 (March 15).

@sungchun12 @dave-connors-3 @Victoriapm @b-per Happy to hear any opinions / instincts / leanings you might have!

Should not_null be its own top-level boolean attribute of the column, rather than nested under constraints?

Con: I worry that this obfuscates how not_null is actually being implemented behind-the-scenes. It's still operating as a constraint, with all the other gotchas & caveats of other constraint types — e.g. not supported on views, only supported as a "post-flight" check on Databricks

Is this already true for data_type as a column-level attribute? Not necessarily a reason to double down, but just pointing out that there is some precedence here and that not_null on the column-level would be consistent with data_type in terms of functional behaviour.

Yes, data_type is already a column-level attribute! I agree that's a compelling reason to keep the two together.

If we can make #6751 happen — the way we'll be verifying the "contract" for data types, as a pre-flight check, will differ from the way we verify not_null (passing into data platform, for enforcement during create / insert / merge / etc). The data type enforcement won't actually be via a "constraints"—in the sense of, the database capability named "constraints"—whereas the not_null enforcement would very much be a "constraint.". We shouldn't make too big a deal of that; the data platform vendors not infrequently use the same familiar words in subtly different ways.

When detecting breaking changes to model contracts, we ignore all unenforced constraints
columns:
  - name: id
    data_type: numeric
    constraints:

      # on all data platforms (even ones that call this 'nullable')
      - not_null: true

      # only allowed for data platforms that actually enforce 'check' constraints
      - check: "id > 0"

      # everything else, for metadata purposes (and sometimes query optimization ... oh boy)
      - unenforced: primary key
      - unenforced: foreign key
Note that BigQuery will actually raise an explicit error if you try to set constraints that aren't enforced! Personally, I'm supportive of this, but it's another platform-specific behavior to be aware of:
create table dbt_jcohen.my_cool_table (id int primary key) as (select 1 as id);
Enforcement of primary keys is not supported

Postgres enforces primary and foreign keys. Would we then implement formal config constraints like: primary_key: true and foreign_key: true?

Also, I'm a big fan of housing all the constraints config under a single config vs. decoupling constraints and checks

@sungchun12 Really good point that Postgres actually does enforce primary key and foreign key constraints! There aren't any other analytical data platforms that enforce those constraints, and I don't think we should design this feature primarily for Postgres. That said, we could imagine a future where those data platforms do support all four of the ANSI SQL standard constraints—unique, primary key, foreign key, not null—plus custom row-level check constraints as well.

So we do need some way for each adapter to tell us which constraints it's actually able to enforce at build time. For Postgres, that's all of them. For Redshift, Snowflake, and BigQuery, it's just not_null. For Databricks, per the current implementation in dbt-spark / dbt-databricks, it's actually none of them—even though not null and check constraints are enforced, they aren't enforceable until after atomically materializing a table—but we're talking to their team about ways they/we could make this better.

Then, we can use that adapter-level check to tell us:

If this constraint is actually going to do anything. We should warn the user about constraints that aren't both supported and enforced. (We should also give them a way to silence those warnings.)
Whether this constraint type can be considered part of the model's contract (set of build-time guarantees), and whether removing it qualifies as a breaking change to that contract (https://github.com/dbt-labs/dbt-core/issues/6869)

Decision: `not_null` is special

Update: This decision has been reversed, after further discussion below. I'm leaving the rest of this comment as-is for posterity. That includes my original reasoning here, and the implications for the spec further down. We'll be creating a set of new issues to follow up on this discussion.

Why?

It's supported & enforced by all our data platforms.
"A not-null constraint is always written as a column constraint." (Postgres docs)
It's one of the 7 ~deadly descriptors~ integral attributes of columns, per the DB-API2 spec.

So I think we should "promote" not_null to be a column-level attribute, outside the constraints config. In the meantime, we will present & document that constraints varies greatly by data platform — caveat usor, passibus vestra variari.

Data structure

Constraint is a new object type. Pseudo code:

class ConstraintOption(StrEnum):
    check = "check"
    unique = "unique"
    primary_key = "primary_key"
    foreign_key = "foreign_key"
    custom = "custom"  # I have no idea -- but let's leave it open for the wacky & wild data platforms out there

ColumnLevelConstraint:
    type: ConstraintOption
    name: Optional[str]
    warn_unenforced: bool = True  # raise a warning if this constraint is not enforced by this data platform (but will still be included in templated DDL)
    warn_unsupported: bool = True  # raise a warning if this constraint is not supported by this data platform (and will be excluded from templated DDL)
    expression: Optional[str] # free for all. required if constraint is 'check' or 'foreign_key' -- can we validate via post-init ?
    # additional properties allowed: yes! there are going to be lots of constraint-specific and platform-specific shenanigans

# the same, with an added 'columns' attribute (which isn't needed when the constraint is defined for one column)
ModelLevelConstraint(ColumnLevelConstraint):
    columns: List[str]

models:
  constraints: Optional[List[ModelLevelConstraint]]
  columns:
    - name: str
      data_type: Optional[str]
      not_null: Optional[bool]
      constraints: Optional[List[ColumnLevelConstraint]]

Notes:

constraints is an optional attribute at both the model level and the column level
If constraints is set a the model level, it should be a model-level attribute, not a "config." This means it isn't settable in {{ config() }} or in dbt_project.yml, and that's okay.
warn_unenforced & warn_unsupported could alternatively be Optional[bool], and flipped/renamed to ignore_warning_unenforced & ignore_warning_unsupported, so that the data is only included when it's a non-default value
not_null is a boolean column-level attribute. It is NOT model-level, and it is NOT included in the constraints data structure!
That's about as smart as I feel like being for now. (And even this might still be too smart.) I don't care to do any detection of, e.g., multiple unique constraints being defined for a single column, or primary_key constraints being defined for multiple columns in a single model. All of that will be caught at runtime, and raise an error in the data platform. I want the minimum viable data structure that enables us to detect breaking changes to model contracts.

New adapter method

We need a new adapter method that can tell dbt whether a constraint is:

supported & enforced
supported but not enforced
not supported

Here's what that might look like for dbt-snowflake, in pseudo code:

@classmethod
def constraint_support:
    return {
        "not_null": Constraint.ENFORCED,
        "primary_key": Constraint.NOT_ENFORCED,
        "foreign_key": Constraint.NOT_ENFORCED,
        "unique": Constraint.NOT_ENFORCED,
        "check": Constraint.NOT_SUPPORTED,
    }

As noted at the top, this method would be called in two places at runtime:

During node selection, to determine which constraint diffs qualify as breaking changes to the model's contract (https://github.com/dbt-labs/dbt-core/issues/6869)
during materialization, to raise warnings for unenforced/unsupported constraints (unless the user has disabled)

The assumption is that not_null is supported & enforced for all data platforms. If there's a data platform that doesn't enforce/support it, that platform's adapter will need to own the additional logic for handling users who set not_null: true on columns.

Examples

I'm using dbt-postgres. Everything is supported & enforced. The removal or modification* of any existing constraint should be considered a breaking change. This includes check constraints. (Should we be nice and say, wellllll if the check constraint has the same name, even if the expression has changed, it's probably the same?)

models:
  - name: my_model
    columns:
      - name: id
        data_type: int
        not_null: true
        constraints:
          - type: primary_key
          - type: check
            expression: ">0"
            name: id_never_negative
      ...
    constraints:
      - name: my_compound_foreign_key
        type: foreign_key
        columns: [customer_id, customer_region_id]
        # we could be cleverer about the structure here, but I'm hesitant -- want to find a stopping point
        expression: "customers_by_region (id, region_id)" # can also include things like 'ON DELETE SET NULL ...'

dbt will produce SQL like:

create table my_schema.my_model (
    id int not null primary key constraint id_never_negative (id > 0),
    ...
    constraint my_compound_foreign_key foreign key (customer_id, customer_region_id) references customers_by_region (id, region_id)
);

I'm using dbt-snowflake. I want to add an unenforced primary_key constraint to my id column. I'm well aware it doesn't do anything—I'm putting it there for integration with a data catalog tool—and I just want to stop seeing the warning. If I remove it later on, it won't be considered a breaking change, because it's not really part of the model contract. However, if I change not_null: true, that would be considered a breaking change—it's an integral part of the contract, enforced at build time.

columns:
  - name: id
    data_type: integer
    not_null: true
    constraints:
      - type: primary_key
        warn_unenforced: false  # make the warning disappear

dbt will produce SQL like:

create table my_schema.my_model (
    id int not null primary key,
    ...
);

Questions?

Drop a comment here, or let's discuss in Slack! I want to get us unblocked for estimation & implementation here ASAP. Given the amount I ended up writing in this comment, it might make sense to split up this work into multiple tickets.

@jtcohen6 Replying here for traceability, and let's get down and dirty in slack!

So we do need some way for each adapter to tell us which constraints it's actually able to enforce at build time. For Postgres, that's all of them. For Redshift, Snowflake, and BigQuery, it's just not_null. For Databricks, per the current implementation in dbt-spark / dbt-databricks, it's actually none of them—even though not null and check constraints are enforced, they aren't enforceable until after atomically materializing a table—but we're talking to their team about ways they/we could make this better.

Then, we can use that adapter-level check to tell us:

If this constraint is actually going to do anything. We should warn the user about constraints that aren't both supported and enforced. (We should also give them a way to silence those warnings.) Whether this constraint type can be considered part of the model's contract (set of build-time guarantees), and whether removing it qualifies as a breaking change to that contract (https://github.com/dbt-labs/dbt-core/issues/6869)

I agree with telling the user which constraints are enforceable at build time. Probably a terminal logging message that lets the user know clearly what's NOT being enforced and points to the specific config shouting this warning vs. a generic, "Hey you're using primary key somewhere in the mix and that's a no no. Good luck finding it". I also like the extra flavor of whether it qualifies as a breaking change to the contract. That opinionated flavor is just the right touch to make this a healthy guardrail during development.

Decision: not_null is special Why?

It's supported & enforced by all our data platforms. "A not-null constraint is always written as a column constraint." (Postgres docs) It's one of the 7 deadly descriptors integral attributes of columns, per the DB-API2 spec. So I think we should "promote" not_null to be a column-level attribute, outside the constraints config. In the meantime, we will present & document that constraints varies greatly by data platform — caveat usor, passibus vestra variari.

I understand the implementation reasoning, but the developer reasoning conflicts with this logic in practice. If I'm a developer building constraints for the first time OR adjusting constraints in an existing contract, I'll be wondering why not_null is outside of my constraints config hierarchy. I'll think to myself, "Is not_null not a constraint or a test, wait can I do that with unique, can I do that with anything custom?" And let's say I want to build functionality and call the manifest to run dbt commands based on constraints configs, how do I ensure not_null is flowing through properly in my contract without second guessing where the config is placed?

Notes:

constraints is an optional attribute at both the model level and the column level

Agreed

If constraints is set a the model level, it should be a model-level attribute, not a "config." This means it isn't settable in {{ config() }} or in dbt_project.yml, and that's okay.

Agreed AND I'd like to see +contracts: true settable in dbt_project.yml that points the developer to define constraints at the model level

warn_unenforced & warn_unsupported could alternatively beOptional[bool], and flipped/renamed to ignore_warning_unenforced&ignore_warning_unsupported`, so that the data is only included when it's a non-default value

Agreed on functionality, I recommend warn_unenforced & warn_unsupported as it's easier to reason about the bool logic at a glance

not_null is a boolean column-level attribute. It is NOT model-level, and it is NOT included in the constraints data structure!

Agreed on boolean column-level attribute and NOT model-level. Do NOT agree on constraints config hierarchy.

That's about as smart as I feel like being for now. (And even this might still be too smart.) I don't care to do any detection of, e.g., multiple unique constraints being defined for a single column, or primary_key constraints being defined for multiple columns in a single model. All of that will be caught at runtime, and raise an error in the data platform. I want the minimum viable data structure that enables us to detect breaking changes to model contracts.

Agreed, we don't want to provide so many guardrails that we create a prison and not the fun sandbox it should be. Thankfully, the databases provide helpful tips when constraints are wrong based on ad hoc testing.

We need a new adapter method that can tell dbt whether a constraint is:

supported & enforced supported but not enforced not supported

Agreed on these primitives!

I'm using dbt-postgres. Everything is supported & enforced. The removal or modification* of any existing constraint should be considered a breaking change. This includes check constraints. (Should we be nice and say, wellllll if the check constraint has the same name, even if the expression has changed, it's probably the same?)

I'd say BREAKING change because the substance of the contract has changed because the expression changed. I'd be okay with it NOT being a breaking change if ONLY the name of the check is different but the expression remains the same.

@sungchun12 Love this feedback! The only place where you + I disagree is about whether not_null should be "special." To summarize the arguments for/against:

It should be special (promoted to column-level attribute) because it's column-level only; supported & enforced on all data platforms; part of DB-API2 standard for integral column metadata.
It shouldn't be special because the underlying implementation ("constraint") looks the same, in terms of the actual templated DDL that dbt is producing and passing down. (Even if, on many data platforms, not_null is the only one they actually support/enforce!)

@MichelleArk @dbeatty10 Very open to hearing your thoughts / votes one way or the other!

Given that this is the only outstanding question, I think we can move this issue out of "refinement" and into "estimation." We may want to update the original issue description with the final proposed spec, or open as a new "clean" issue.

I would be more in favor of "not special".

If more warehouses start to enforce additional constraints, e.g. unique, people might start wondering why not_null and unique are configured differently when they actually are both enforced by the warehouse.

@b-per Thanks for weighing in!!

Let me flip that around: If that happens — or shall we say "when" :) — what would you think about promoting unique and primary_key as first-class column-level attributes? Alongside not_null, data_type, name. These are all things that are guaranteed to be true about the column.

In the current proposal, constraints is intentionally a more-flexible "grab bag": these are things that might be true, according to you, with support that varies across platforms.

With this approach, are we saying that when a warehouse starts supporting primary_key we will need to change dbt-core (and/or dbt-warehouse) to get it supported in dbt? Or are you saying that we would make those available from the get go, even if they don't work today? What I think would be a not so great experience would be to have to wait for the next dbt minor version to get support added once it is done on the Warehouse side.

On the first-class attribute list, would we also consider foreign_key?

@b-per I'm not aware of any columnar / analytical data platforms that can enforce constraints for uniqueness, primary keys, and foreign keys. (I think they can be sorta supported in columnar extensions/forks of transactional databases, with some gotchas — e.g. Citus docs — but just as often not, e.g. MariaDB column store.)

So I think I'd be open to adding these as first-class attributes once there's been a precedent set by at least one (ideally two) of the major data platform players. If/when that happens, I hear you on it not being a delightful experience to have to wait for a new version of dbt-core. (Functionally, your primary_key constraint would be settable within the "grab bag" constraints config—and we'd keep this around for backward compability—but you'd need to wait for that new version to "promote" its definition to a top-level attribute.)

@jtcohen6 I worry that this obfuscates how not_null is actually being implemented behind-the-scenes. It's still operating as a constraint, with all the other gotchas & caveats of other constraint types — e.g. not supported on views, only supported as a "post-flight" check on Databricks

@sungchun12 If I'm a developer building constraints for the first time OR adjusting constraints in an existing contract, I'll be wondering why not_null is outside of my constraints config hierarchy. I'll think to myself, "Is not_null not a constraint or a test, wait can I do that with unique, can I do that with anything custom?"

These are super valid usability concerns, and to me outweigh the benefits to promoting not-null to a column-level attribute. That said, I do still think not null is 'special' for the reasons stated above: it's column-level only; supported & enforced on all data platforms; part of DB-API2 standard for integral column metadata.

My motivation for seeing not null as special is so that its presence can be used as input to detecting breaking changes to model contracts as part of the state:modified check. But it doesn't need to be a column-level attribute to implement that!

To me this comes down to what is part of the model contract (enforceable pre-flight, part of the state:modified check) vs model constraints (warehouse specific behaviour + configuration). The new constraint_support adapter method encodes which constraints have Constraint.ENFORCED. Enforced constraints would be part of the model contract, and any modifications that relax constraints that are part of the model contract would be considered breaking.

As a summary, there are six (6) constraint types that have been discussed:

not_null
primary_key
foreign_key
unique
check
custom

To help me process pros/cons, I made an attempt at re-phrasing points that have been raised so far.

Let's suppose that it should be special (and promoted to a column-level attribute):

Pros

Single column. It's always restricted to a single column (whereas the rest can span multiple columns)
Current enforcement. Supported & enforced on all data platforms (which is not currently true for the other constraint types)
Standards:
- part of DB-API2 standard for integral column metadata
- part of SQL standard for integral column metadata (ISO/IEC 9075 SQL:2016 Part 11)

Cons

Visual grouping. Most constraints would be grouped together under the constraints: key, but not_null wouldn't be in that group
Implementation details. The underlying implementation ("constraint") looks the same, in terms of the actual templated DDL that dbt is producing and passing down.
- Behind-the-scenes, not_null is still operating as a constraint, with all the other gotchas & caveats of other constraint types — e.g. not supported on views, only supported as a "post-flight" check on Databricks
Future enforcement. Other constraints like primary_key could theoretically be supported & enforced on all data platforms at some point in the future
Detecting breaking changes. It doesn't need to be a column-level attribute to be used as input to detecting breaking changes to model contracts as part of the state:modified check

My opinion: all constraints grouped together visually

TL;DR "it shouldn't be special"

From what I can tell so far, it seems like our code will be able to handle things either way. Please speak up if this is not that the case!

If so, for me it would come down to what type of visual grouping we want for human cognition:

all constraints together, regardless of enforced or not
constraints separated between enforced and not enforced
constraints separated between those that must be single-column and those that may be multi-column

Most often, I come down on the side of whatever the standards suggest (which would be Option 3: visually separating single-column and multi-column). In this case, I'd lean towards visually grouping all constraints together since I think it will be more intuitive for users and it doesn't seem like we'd be giving up much if anything at all. (Maybe I should have done pros/cons from the "not special perspective instead!)

Ultimately, this is pretty loosely held for me, because I think our users will be successful either way as long as we have good docs and good error messages.

Upon further review, the ruling on the field is overturned! Thank you @sungchun12 @b-per @MichelleArk @dbeatty10 for weighing in with such thoughtful & well-articulated comments. Let's keep not_null nested within constraints for the foreseeable.

We may want to add primary_key and foreign_key as top-level column-level (or dimension-level) attributes, and thereby enable some semantic layer wizardry... but that's a conversation for another thread, another day :)

@jtcohen6 thanks for listening and caring so much. This thread is such a lovely role model for people to debate healthily and thoughtfully!

Closing in favor of #7066 + #7067 for implementation!

dbt-labs / dbt-core