Closed VDFaller closed 3 months ago
Hi! Can you say more about what "source column name changes" means?
Are you imagining you have a yml file defining your source:
version: 2
sources:
- name: jaffle_shop
tables:
- name: orders
columns:
- name: id
- name: status
Then, you add a new column to this yml definition:
version: 2
sources:
- name: jaffle_shop
tables:
- name: orders
columns:
- name: id
- name: status
- name: price
And you want the source('jaffle_shop', 'orders')
to be selected by state:modified
?
This is a bit odd because the source yml isn't the actual definition of what the schema of your source is. Even if you haven't defined a column in the source yml, but that column exists in the source table, you can still select it in your dbt code.
So knowing the yml has changed, doesn't actually tell you if the underlying warehouse object has changed. dbt does not "manage" your sources (i.e. it's not in charge of the DDL that creates your source tables; it's just defining metadata to create a pointer to your source table).
I view "state:modified
" as a change that affects the underlying SQL that's running against the warehouse. So given that a change to the source yml doesn't actually change any of the SQL being run, I would be surprised for it to be selected by state:modified
.
This would be different for external tables (cc: @dataders).
@graciegoheen Sorry for the lack of clarity.
My problem was that the source column name actually changed, and people didn't edit the SQL code because they didn't know it was used. A developer changed it in source but didn't change any of the SQL code downstream.
version: 2
sources:
- name: jaffle_shop
tables:
- name: orders
columns:
- name: id
- name: status
to
version: 2
sources:
- name: jaffle_shop
tables:
- name: orders
columns:
- name: id
- name: status_different
And our CI runs, -s state:modified+
. I know that the yml isn't the end all be all of what's in the source, but when it changes I want to know. If it were to capture the column changes and then run downstream, I would be able to know that their job isn't done and they need to edit some SQL as well.
Thanks for sharing that example! I still believe this shouldn't be marked as modified
given that this change does not actually affect any of the underlying SQL that dbt is running so am going to close this issue as "not planned".
In your case, I would suggest you use staging models (docs here). If a column that's being used in your code is renamed in your source, you would have to update your staging model as well:
select
id,
status_different, --rename old status column
...
from {{ source('jaffle_shop', 'orders') }}
This model would then be marked as modified, and you could see if anything else needs downstream to be updated as well with a CI run using -s state:modified+
.
I'm 100% using staging models. The problem is that they didn't change them. I agree if they would have, it would have worked. But they didn't and this didn't catch it. If you have any workarounds, I'd love to hear them.
Here's how I'm thinking about it. There's two types of changes that could break you dbt models:
The focus of CI is for validating that the latter doesn't break anything in your project. Updating your source yml file doesn't actually change any of the underlying SQL, it's a "documentation" change only. The types of changes that are selected by state:modified
are those that impact the SQL dbt is executing (change to the actual SQL of a model, change to the materialization strategy, etc.). Similarly, we wouldn't expect a change to a description to cause a model to be selected as "modified".
version: 2
models:
- name: model_name
description: my new description # adding or updating description shouldn't flag model_name as modified
Perhaps you could add a test to your source (like expect_table_columns_to_match_set from dbt-expectations package), so that the test would fail if you have schema drift of your sources. Then you could add a step to your CI job to first test all of your sources:
dbt test --select "source:*"
dbt build -s state:modified+
Though changes to warehouse objects that dbt depends on may happen at any time, so it would technically be unrelated to the PR you've opened.
I wouldn't expect description to matter obviously. And if I add a test to source it still wouldn't fix the problem because they'd just fix that test and still not propagate changes to the downstream models. I could add a comment reminding people but that seems antithetical to the purpose of the pipeline. By your argument NOTHING that affects the source would affect the sql model so nothing should be in. But in the code it states
# config changes are changes (because the only config is "enabled", and
# enabling a source is a change!)
# changing the database/schema/identifier is a change
# messing around with external stuff is a change (uh, right?)
# quoting changes are changes
# freshness changes are changes, I guess
# metadata/tags changes are not "changes"
# patching/description changes are not "changes"
if quoting changes are changes, why would column names not be? Seems like it's quite parallel in thought process. Freshness has nothing to do with if the dbt sql runs. So I'd disagree with your assertion.
Quoting does change the SQL that's being executed, because it impacts how the source macro will be resolved.
So when you reference a source in a model:
{{ source('jaffle_shop', 'orders') }}
That will resolve differently based on how you have configured quoting:
# if quoting is true for database, schema, and identifier
"raw"."jaffle_shop"."orders"
# if quoting is true for just database and schema (false for identifier)
"raw"."jaffle_shop".orders
Similarly, freshness can impact the SQL that's run when you execute the dbt source freshness
command. For example let's say you add a filter to your freshness config:
version: 2
sources:
- name: jaffle_shop
database: raw
loaded_at_field: _etl_loaded_at
tables:
- name: orders
freshness: # make this a little more strict
warn_after: {count: 6, period: hour}
error_after: {count: 12, period: hour}
# Apply a where clause in the freshness query
filter: datediff('day', _etl_loaded_at, current_timestamp) < 2
that filter will be added to the freshness query:
select
max(_etl_loaded_at) as max_loaded_at,
convert_timezone('UTC', current_timestamp()) as snapshotted_at
from raw.jaffle_shop.orders
where datediff('day', _etl_loaded_at, current_timestamp) < 2
fair enough. Thanks for at least looking at it.
Is this your first time submitting a feature request?
Describe the feature
Currently it doesn't appear that changes to source column names is detected.
I'm suggesting we detect at least if the column names change.
We could just add a
same_columns
method in ParsedSourceDefinitionDescribe alternatives you've considered
If more than name is required, might want to add a
self.columns.same_contents
method similar toself.config.same_contents
.Who will this benefit?
Maintainers trusting their CI Pipelines (maybe too much) when source columns change out from under them.
Are you interested in contributing this feature?
Yes
Anything else?
Slack Conversation