[Feature] state:modified consider source column name changes as changes

dbt-labs / dbt-core

dbt enables data analysts and engineers to transform their data using the same practices that software engineers use to build applications.

https://getdbt.com

Apache License 2.0

9.64k stars 1.6k forks source link

[Feature] state:modified consider source column name changes as changes #10020

Closed VDFaller closed 3 months ago

VDFaller commented 4 months ago

Is this your first time submitting a feature request?

[X] I have read the expectations for open source contributors
[X] I have searched the existing issues, and I could not find an existing issue for this feature
[X] I am requesting a straightforward extension of existing dbt functionality, rather than a Big Idea better suited to a discussion

Describe the feature

Currently it doesn't appear that changes to source column names is detected.
I'm suggesting we detect at least if the column names change.
We could just add a same_columns method in ParsedSourceDefinition

def same_columns(self, old: 'ParsedSourceDefinition') -> bool:
    return sorted([c for c in old.columns]) == sorted([c for c in self.columns])

Describe alternatives you've considered

If more than name is required, might want to add a self.columns.same_contents method similar to self.config.same_contents.

Who will this benefit?

Maintainers trusting their CI Pipelines (maybe too much) when source columns change out from under them.

Are you interested in contributing this feature?

Yes

Anything else?

Slack Conversation

graciegoheen commented 4 months ago

Hi! Can you say more about what "source column name changes" means?

Are you imagining you have a yml file defining your source:

version: 2

sources:
  - name: jaffle_shop
    tables:
      - name: orders
        columns:
          - name: id
          - name: status

Then, you add a new column to this yml definition:

version: 2

sources:
  - name: jaffle_shop
    tables:
      - name: orders
        columns:
          - name: id
          - name: status
          - name: price

And you want the source('jaffle_shop', 'orders') to be selected by state:modified?

This is a bit odd because the source yml isn't the actual definition of what the schema of your source is. Even if you haven't defined a column in the source yml, but that column exists in the source table, you can still select it in your dbt code.

So knowing the yml has changed, doesn't actually tell you if the underlying warehouse object has changed. dbt does not "manage" your sources (i.e. it's not in charge of the DDL that creates your source tables; it's just defining metadata to create a pointer to your source table).

I view "state:modified" as a change that affects the underlying SQL that's running against the warehouse. So given that a change to the source yml doesn't actually change any of the SQL being run, I would be surprised for it to be selected by state:modified.

This would be different for external tables (cc: @dataders).

VDFaller commented 4 months ago

@graciegoheen Sorry for the lack of clarity.

My problem was that the source column name actually changed, and people didn't edit the SQL code because they didn't know it was used. A developer changed it in source but didn't change any of the SQL code downstream.

version: 2

sources:
  - name: jaffle_shop
    tables:
      - name: orders
        columns:
          - name: id
          - name: status

version: 2

sources:
  - name: jaffle_shop
    tables:
      - name: orders
        columns:
          - name: id
          - name: status_different

And our CI runs, -s state:modified+. I know that the yml isn't the end all be all of what's in the source, but when it changes I want to know. If it were to capture the column changes and then run downstream, I would be able to know that their job isn't done and they need to edit some SQL as well.

graciegoheen commented 3 months ago

Thanks for sharing that example! I still believe this shouldn't be marked as modified given that this change does not actually affect any of the underlying SQL that dbt is running so am going to close this issue as "not planned".

In your case, I would suggest you use staging models (docs here). If a column that's being used in your code is renamed in your source, you would have to update your staging model as well:

select
    id,
    status_different, --rename old status column
    ...
from {{ source('jaffle_shop', 'orders') }}

This model would then be marked as modified, and you could see if anything else needs downstream to be updated as well with a CI run using -s state:modified+.

VDFaller commented 3 months ago

I'm 100% using staging models. The problem is that they didn't change them. I agree if they would have, it would have worked. But they didn't and this didn't catch it. If you have any workarounds, I'd love to hear them.

graciegoheen commented 3 months ago

Here's how I'm thinking about it. There's two types of changes that could break you dbt models:

changes to warehouse objects that dbt depends on (example: a source column is renamed)
changes to the SQL dbt is executing

The focus of CI is for validating that the latter doesn't break anything in your project. Updating your source yml file doesn't actually change any of the underlying SQL, it's a "documentation" change only. The types of changes that are selected by state:modified are those that impact the SQL dbt is executing (change to the actual SQL of a model, change to the materialization strategy, etc.). Similarly, we wouldn't expect a change to a description to cause a model to be selected as "modified".

version: 2

models:
  - name: model_name
    description: my new description # adding or updating description shouldn't flag model_name as modified

Perhaps you could add a test to your source (like expect_table_columns_to_match_set from dbt-expectations package), so that the test would fail if you have schema drift of your sources. Then you could add a step to your CI job to first test all of your sources:

dbt test --select "source:*"
dbt build -s state:modified+

Though changes to warehouse objects that dbt depends on may happen at any time, so it would technically be unrelated to the PR you've opened.

VDFaller commented 3 months ago

I wouldn't expect description to matter obviously. And if I add a test to source it still wouldn't fix the problem because they'd just fix that test and still not propagate changes to the downstream models. I could add a comment reminding people but that seems antithetical to the purpose of the pipeline. By your argument NOTHING that affects the source would affect the sql model so nothing should be in. But in the code it states

        # config changes are changes (because the only config is "enabled", and
        # enabling a source is a change!)
        # changing the database/schema/identifier is a change
        # messing around with external stuff is a change (uh, right?)
        # quoting changes are changes
        # freshness changes are changes, I guess
        # metadata/tags changes are not "changes"
        # patching/description changes are not "changes"

if quoting changes are changes, why would column names not be? Seems like it's quite parallel in thought process. Freshness has nothing to do with if the dbt sql runs. So I'd disagree with your assertion.

graciegoheen commented 3 months ago

Quoting does change the SQL that's being executed, because it impacts how the source macro will be resolved.

So when you reference a source in a model:

{{ source('jaffle_shop', 'orders') }}

That will resolve differently based on how you have configured quoting:

# if quoting is true for database, schema, and identifier
"raw"."jaffle_shop"."orders"

# if quoting is true for just database and schema (false for identifier)
"raw"."jaffle_shop".orders

Similarly, freshness can impact the SQL that's run when you execute the dbt source freshness command. For example let's say you add a filter to your freshness config:

version: 2

sources:
  - name: jaffle_shop
    database: raw
    loaded_at_field: _etl_loaded_at
    tables:
      - name: orders
        freshness: # make this a little more strict
          warn_after: {count: 6, period: hour}
          error_after: {count: 12, period: hour}
          # Apply a where clause in the freshness query
          filter: datediff('day', _etl_loaded_at, current_timestamp) < 2

that filter will be added to the freshness query:

select
  max(_etl_loaded_at) as max_loaded_at,
  convert_timezone('UTC', current_timestamp()) as snapshotted_at
from raw.jaffle_shop.orders

where datediff('day', _etl_loaded_at, current_timestamp) < 2

VDFaller commented 3 months ago

fair enough. Thanks for at least looking at it.