dbt-labs / dbt-core

dbt enables data analysts and engineers to transform their data using the same practices that software engineers use to build applications.
https://getdbt.com
Apache License 2.0
9.96k stars 1.63k forks source link

[Feature] Support constraints independently from enforcing a full model contract #10195

Open kmarq opened 5 months ago

kmarq commented 5 months ago

Is this your first time submitting a feature request?

Describe the feature

I would like to have the ability to use many of the features of a contract without being able to fully define a rigid contract. This should allow the ability to apply constraints like Primary and Foreign keys, not null, and other constraints on the columns that are provided but then allow for schema evolution to add new columns that are not governed by the contract.

Describe alternatives you've considered

Using post-hooks to alter tables and apply the constraints directly.

Moving constraints to be unrelated to contracts and making the contract flag related to enforcement and privacy rather than controlling application of table constraints

Who will this benefit?

Developers using dbt especially in earlier stages of data transformation that want to use constraints to define relationships and key columns but leave their model open to schema evolution for additional columns to be loaded

Are you interested in contributing this feature?

Possibly but may not be a great first one

Anything else?

Breaking this off from the discussion here: Infer Schema contract discussion

graciegoheen commented 5 months ago

Hey! Looks like this is similar to this other ticket when we closed as not planned.

This should allow the ability to apply constraints like Primary and Foreign keys, not null, and other constraints on the columns that are provided but then allow for schema evolution to add new columns that are not governed by the contract.

The way contracts works today is we apply the constraints and schema (data types / column names) during the CREATE TABLE ... statements. So in order to apply to constraint at that time, we would also need to enforce the schema of the table.

It sounds like what you're asking for is to apply the constraint as an ALTER TABLE ... statement. Is that correct?

Most databases don't actually enforce constraints beyond not null - see our docs here. Which warehouse are you using?

For snowflake, for example, even if you were to run an ALTER TABLE... statement after the table had been created to add a primary key constraint because it's not actually enforced, you could still insert records that violated that expectation.

I would recommend using our unique and not_null tests if you want to validate that your primary key is unique / never null.

kmarq commented 5 months ago

Thanks for the reply. From my understanding they are different in that the referenced ticket is still looking to fully contract a model (100% of fields defined) but allowing a way to do a higher level contract definition that can then be inherited and extended at the model level. I'm looking to still define just at the model level, but not have to define all fields if they are not important for the constraints required.

I could see applying this in the create table up front, and you'd create the table with the provided columns and then would have to allow for schema evolution to add any additional columns that the model defines but are not part of the partial contract. It could be done via alter after the fact as well, but I know that some databases won't let you apply PK constraints outside of table creation, so I would lean towards the partial definition within the create statement.

I am using Databricks, which is not really applying a majority of the constraints. However, I do want to be able to define the PK/FK relationships and ideally the ones that are enforced such as not null. I am using dbt pretty early in the data lifecycle where I want to allow my tables to evolve as new source columns are added, but that ideally wouldn't prevent me from defining the known constraints up front.

Beyond the testing of the data, my end goal is to support data users that will benefit from being able to generate table relationships directly from the table metadata vs maintaining external ERD documents. As well as Databricks providing the ability to generate ERD documents directly from the table relationships where they are defined.

This is what leads me to suggest that constraints and contracts while likely related are really two separate concepts.

dbeatty10 commented 5 months ago

@kmarq Based on reading your feedback, I just updated the issue title to be "[Feature] Support constraints independently from enforcing a full model contract". Just let us know if that title doesn't seem accurate and we can try re-titling it.

kmarq commented 5 months ago

That's good by me. Thank you

graciegoheen commented 5 months ago

Thanks for the follow up.

I would lean towards the partial definition within the create statement.

I don't believe you can have a partial definition of your table within the create statement. Looking at the databricks docs:

table_specification This optional clause defines the list of columns, their types, properties, descriptions, and column constraints. If you do not define columns the table schema you must specify either AS query or LOCATION.

If you want to define a table specification, you can't only define column constraints nor can you define a partial list of columns. So I'm not sure that what you're suggesting would be possible. Since constrains and "contracts" are both required during the CREATE TABLE statement, we have them bundled together.

What I believe you can do is apply non-enforced constraints via an ALTER statement after the table has been created. Sounds like that's the current workaround you're using via post-hooks?

Is this feature request just to have dbt run that ALTER statement for you if you've configured just a constraint (not require the contract enforced: true config)?

models:
  - name: dim_customers
    columns:
      - name: id
        constraints:
          - type: primary_key
graciegoheen commented 5 months ago

Here's another related issue on using constrains for ERD diagrams like the use case you mentioned - https://github.com/dbt-labs/dbt-core/issues/3295

kmarq commented 5 months ago

If you want to define a table specification, you can't only define column constraints nor can you define a partial list of columns. So I'm not sure that what you're suggesting would be possible. Since constrains and "contracts" are both required during the CREATE TABLE statement, we have them bundled together.

When I say partial, I mean that all columns requiring constraints would be provided up front so they can be done in the Create Table statement. Additional columns could then be added to the model via the normal schema evolution process. So I could define something like:

models:

  • name: dim_customers columns:
    • name: id constraints:
      • type: primary_key

My model might then have

select id, extra_col from table_a

Where extra_col would be added as a new nullable column during execution.

What I believe you can do is apply non-enforced constraints via an ALTER statement after the table has been created. Sounds like that's the current workaround you're using via post-hooks?

Is this feature request just to have dbt run that ALTER statement for you if you've configured just a constraint (not require the contract enforced: true config)?

models:
  - name: dim_customers
    columns:
      - name: id
        constraints:
          - type: primary_key

Applying the constraints as an after creation alter would be acceptable to my use case. However, I really think that while contracts and constraints do seem highly related due to being needed at table creation time, that they are two separate concepts and ideally should be split. Contracts are providing much more additional functionality beyond constraints, but constraints on their own outside of contracts can provide value as well.

kmarq commented 5 months ago

Recreated some previous testing of constraints. If you define a primary_key constraint on a column

    columns:
      - name: col_1
        data_type: string
        constraints:
          - type: not_null
          - type: primary_key

The constraint shows up in the model.columns

"columns": {
        "col_1": {
            "constraints": [
                {
                    "type": "not_null",
                    "warn_unenforced": true,
                    "warn_unsupported": true
                },

whether the contract enforcement is true or false.

If you define a composite primary_key on the model

models:
  - name: model_a
    config:
      contract:
        enforced: true
    description: description
    columns:
      - name: col_1
        data_type: string
        constraints:
          - type: not_null
          - type: primary_key
      - name: col_2
        data_type: string
        constraints:
          - type: not_null
    constraints:
      - type: primary_key
        columns: [col_1, col_2]

Then in the model.constraints you will see this ONLY if contract enforcement is true:

 "constraints": [
        {
            "columns": [
                "col_1",
                "col_2"
            ],
            "type": "primary_key",
            "warn_unenforced": true,
            "warn_unsupported": true
        }
    ],
 "contract": {
        "alias_types": true,
        "checksum": "65ba38db6e4dd6a88418461c76b805acd2e39a113404a8a9c3d556c020564ea4",
        "enforced": true
    },

compared to false:

"constraints": [],
"contract": {
        "alias_types": true,
        "enforced": false
    },

So currently the ability to apply constraints after the fact outside of contracts is highly limited

graciegoheen commented 5 months ago

Thanks for all of these examples, I really appreciate it.

So for the example you gave where you have a model with a constraint defined:

models:
  - name: dim_customers
    columns:
      - name: id
        constraints:
          - type: primary_key

and then that model has additional columns in the SQL:

# models/dim_customers.sql

select
    id, 
    extra_col
from table_a

I'm not sure how "Additional columns could then be added to the model via the normal schema evolution process" would be possible today.

In this case, dbt would not be aware of extra_col when it's creating the CREATE TABLE statement since it's not defined in the model yml. So the CREATE TABLE statement would be incomplete - it would only define the id column with the primary_key constraint. But since it has no knowledge of extra_col when first creating that table, trying to then insert a record into that table with both id and extra_col would fail.

Additionally, you'd need to know the data types of the columns for the CREATE TABLE statement; and those are missing from the yml as well.

kmarq commented 5 months ago

Yeah, I see. I was definitely thinking Databricks centric and that we could do a merge with schema evolution enabled vs a straight insert. That would handle automatically adding extra_col. We do that in our non-dbt processes today. The table gets created with any audit required columns and partition columns the user defines then we do a merge to insert records so the schema updates.

I could definitely do something on my side even if its overriding the table materialization to do something like that.

I think the bigger issue is the ability to have that post table creation alter applied to put the non-enforced constraints in place. With the above issue, it is not really possible to implement in something like a post_hook. So maybe this is really a smaller request to have the constraints come through on the config even if the contract isn't enforced? Then adapters could implement applying them yet if it made sense on the platform, or it could be done via macro in a hook if necessary.

graciegoheen commented 5 months ago

Ah! That makes sense about the Databricks use case, thanks for explaining that to me. I think we'd need to think more about a solution that could work for all warehouses to add this "partial contract" support to dbt-core.

For other warehouses we may need to do something similar to how we get datatypes/column names for unit tests - creating a temp object with the "empty" query (where false limit 0) and then get its column names + types via another metadata query (like describe) to add back into the CREATE TABLE statement. Though, this could impact performance.

So maybe this is really a smaller request to have the constraints come through on the config even if the contract isn't enforced?

I created a new issue for this piece -> https://github.com/dbt-labs/dbt-core/issues/10219

kmarq commented 5 months ago

The new one looks good, thank you.

Happy to help with any other required discussion around potential options to implement this one.

For other warehouses we may need to do something similar to how we get datatypes/column names for unit tests - creating a temp object with the "empty" query (where false limit 0) and then get its column names + types via another metadata query (like describe) to add back into the CREATE TABLE statement. Though, this could impact performance.

I just tried a few examples using this and from a query perspective it is very fast. It is definitely more total steps and that adds up. From my perspective, the users that want to enable something like this partial contract are likely more concerned about getting that additional capability enabled vs the additional processing time. This wouldn't impact every model, only ones that are defined with that, so its at least a conscious tradeoff.

Thanks for the discussion.

b-akhil commented 3 weeks ago

+1