Open kmarq opened 5 months ago
Hey! Looks like this is similar to this other ticket when we closed as not planned.
This should allow the ability to apply constraints like Primary and Foreign keys, not null, and other constraints on the columns that are provided but then allow for schema evolution to add new columns that are not governed by the contract.
The way contracts works today is we apply the constraints and schema (data types / column names) during the CREATE TABLE ...
statements. So in order to apply to constraint at that time, we would also need to enforce the schema of the table.
It sounds like what you're asking for is to apply the constraint as an ALTER TABLE ...
statement. Is that correct?
Most databases don't actually enforce constraints beyond not null - see our docs here. Which warehouse are you using?
For snowflake, for example, even if you were to run an ALTER TABLE...
statement after the table had been created to add a primary key constraint because it's not actually enforced, you could still insert records that violated that expectation.
I would recommend using our unique
and not_null
tests if you want to validate that your primary key is unique / never null.
Thanks for the reply. From my understanding they are different in that the referenced ticket is still looking to fully contract a model (100% of fields defined) but allowing a way to do a higher level contract definition that can then be inherited and extended at the model level. I'm looking to still define just at the model level, but not have to define all fields if they are not important for the constraints required.
I could see applying this in the create table up front, and you'd create the table with the provided columns and then would have to allow for schema evolution to add any additional columns that the model defines but are not part of the partial contract. It could be done via alter after the fact as well, but I know that some databases won't let you apply PK constraints outside of table creation, so I would lean towards the partial definition within the create statement.
I am using Databricks, which is not really applying a majority of the constraints. However, I do want to be able to define the PK/FK relationships and ideally the ones that are enforced such as not null. I am using dbt pretty early in the data lifecycle where I want to allow my tables to evolve as new source columns are added, but that ideally wouldn't prevent me from defining the known constraints up front.
Beyond the testing of the data, my end goal is to support data users that will benefit from being able to generate table relationships directly from the table metadata vs maintaining external ERD documents. As well as Databricks providing the ability to generate ERD documents directly from the table relationships where they are defined.
This is what leads me to suggest that constraints and contracts while likely related are really two separate concepts.
@kmarq Based on reading your feedback, I just updated the issue title to be "[Feature] Support constraints independently from enforcing a full model contract". Just let us know if that title doesn't seem accurate and we can try re-titling it.
That's good by me. Thank you
Thanks for the follow up.
I would lean towards the partial definition within the create statement.
I don't believe you can have a partial definition of your table within the create statement. Looking at the databricks docs:
table_specification This optional clause defines the list of columns, their types, properties, descriptions, and column constraints. If you do not define columns the table schema you must specify either AS query or LOCATION.
If you want to define a table specification, you can't only define column constraints nor can you define a partial list of columns. So I'm not sure that what you're suggesting would be possible. Since constrains and "contracts" are both required during the CREATE TABLE
statement, we have them bundled together.
What I believe you can do is apply non-enforced constraints via an ALTER
statement after the table has been created. Sounds like that's the current workaround you're using via post-hooks
?
Is this feature request just to have dbt run that ALTER
statement for you if you've configured just a constraint (not require the contract enforced: true
config)?
models:
- name: dim_customers
columns:
- name: id
constraints:
- type: primary_key
Here's another related issue on using constrains for ERD diagrams like the use case you mentioned - https://github.com/dbt-labs/dbt-core/issues/3295
If you want to define a table specification, you can't only define column constraints nor can you define a partial list of columns. So I'm not sure that what you're suggesting would be possible. Since constrains and "contracts" are both required during the
CREATE TABLE
statement, we have them bundled together.
When I say partial, I mean that all columns requiring constraints would be provided up front so they can be done in the Create Table statement. Additional columns could then be added to the model via the normal schema evolution process. So I could define something like:
models:
- name: dim_customers columns:
- name: id constraints:
- type: primary_key
My model might then have
select id, extra_col from table_a
Where extra_col would be added as a new nullable column during execution.
What I believe you can do is apply non-enforced constraints via an
ALTER
statement after the table has been created. Sounds like that's the current workaround you're using viapost-hooks
?Is this feature request just to have dbt run that
ALTER
statement for you if you've configured just a constraint (not require the contractenforced: true
config)?models: - name: dim_customers columns: - name: id constraints: - type: primary_key
Applying the constraints as an after creation alter would be acceptable to my use case. However, I really think that while contracts and constraints do seem highly related due to being needed at table creation time, that they are two separate concepts and ideally should be split. Contracts are providing much more additional functionality beyond constraints, but constraints on their own outside of contracts can provide value as well.
Recreated some previous testing of constraints. If you define a primary_key constraint on a column
columns:
- name: col_1
data_type: string
constraints:
- type: not_null
- type: primary_key
The constraint shows up in the model.columns
"columns": {
"col_1": {
"constraints": [
{
"type": "not_null",
"warn_unenforced": true,
"warn_unsupported": true
},
whether the contract enforcement is true or false.
If you define a composite primary_key on the model
models:
- name: model_a
config:
contract:
enforced: true
description: description
columns:
- name: col_1
data_type: string
constraints:
- type: not_null
- type: primary_key
- name: col_2
data_type: string
constraints:
- type: not_null
constraints:
- type: primary_key
columns: [col_1, col_2]
Then in the model.constraints you will see this ONLY if contract enforcement is true:
"constraints": [
{
"columns": [
"col_1",
"col_2"
],
"type": "primary_key",
"warn_unenforced": true,
"warn_unsupported": true
}
],
"contract": {
"alias_types": true,
"checksum": "65ba38db6e4dd6a88418461c76b805acd2e39a113404a8a9c3d556c020564ea4",
"enforced": true
},
compared to false:
"constraints": [],
"contract": {
"alias_types": true,
"enforced": false
},
So currently the ability to apply constraints after the fact outside of contracts is highly limited
Thanks for all of these examples, I really appreciate it.
So for the example you gave where you have a model with a constraint defined:
models:
- name: dim_customers
columns:
- name: id
constraints:
- type: primary_key
and then that model has additional columns in the SQL:
# models/dim_customers.sql
select
id,
extra_col
from table_a
I'm not sure how "Additional columns could then be added to the model via the normal schema evolution process" would be possible today.
In this case, dbt would not be aware of extra_col
when it's creating the CREATE TABLE
statement since it's not defined in the model yml. So the CREATE TABLE
statement would be incomplete - it would only define the id
column with the primary_key
constraint. But since it has no knowledge of extra_col
when first creating that table, trying to then insert a record into that table with both id
and extra_col
would fail.
Additionally, you'd need to know the data types of the columns for the CREATE TABLE
statement; and those are missing from the yml as well.
Yeah, I see. I was definitely thinking Databricks centric and that we could do a merge with schema evolution enabled vs a straight insert. That would handle automatically adding extra_col. We do that in our non-dbt processes today. The table gets created with any audit required columns and partition columns the user defines then we do a merge to insert records so the schema updates.
I could definitely do something on my side even if its overriding the table materialization to do something like that.
I think the bigger issue is the ability to have that post table creation alter applied to put the non-enforced constraints in place. With the above issue, it is not really possible to implement in something like a post_hook. So maybe this is really a smaller request to have the constraints come through on the config even if the contract isn't enforced? Then adapters could implement applying them yet if it made sense on the platform, or it could be done via macro in a hook if necessary.
Ah! That makes sense about the Databricks use case, thanks for explaining that to me. I think we'd need to think more about a solution that could work for all warehouses to add this "partial contract" support to dbt-core.
For other warehouses we may need to do something similar to how we get datatypes/column names for unit tests - creating a temp object with the "empty" query (where false limit 0
) and then get its column names + types via another metadata query (like describe
) to add back into the CREATE TABLE
statement. Though, this could impact performance.
So maybe this is really a smaller request to have the constraints come through on the config even if the contract isn't enforced?
I created a new issue for this piece -> https://github.com/dbt-labs/dbt-core/issues/10219
The new one looks good, thank you.
Happy to help with any other required discussion around potential options to implement this one.
For other warehouses we may need to do something similar to how we get datatypes/column names for unit tests - creating a temp object with the "empty" query (where false limit 0) and then get its column names + types via another metadata query (like describe) to add back into the CREATE TABLE statement. Though, this could impact performance.
I just tried a few examples using this and from a query perspective it is very fast. It is definitely more total steps and that adds up. From my perspective, the users that want to enable something like this partial contract are likely more concerned about getting that additional capability enabled vs the additional processing time. This wouldn't impact every model, only ones that are defined with that, so its at least a conscious tradeoff.
Thanks for the discussion.
+1
Is this your first time submitting a feature request?
Describe the feature
I would like to have the ability to use many of the features of a contract without being able to fully define a rigid contract. This should allow the ability to apply constraints like Primary and Foreign keys, not null, and other constraints on the columns that are provided but then allow for schema evolution to add new columns that are not governed by the contract.
Describe alternatives you've considered
Using post-hooks to alter tables and apply the constraints directly.
Moving constraints to be unrelated to contracts and making the contract flag related to enforcement and privacy rather than controlling application of table constraints
Who will this benefit?
Developers using dbt especially in earlier stages of data transformation that want to use constraints to define relationships and key columns but leave their model open to schema evolution for additional columns to be loaded
Are you interested in contributing this feature?
Possibly but may not be a great first one
Anything else?
Breaking this off from the discussion here: Infer Schema contract discussion