brooklyn-data / dbt_artifacts

A dbt package for modelling dbt metadata. https://brooklyn-data.github.io/dbt_artifacts
Apache License 2.0
323 stars 119 forks source link

Add config blocks for the "source" models #391

Open jsnb-devoted opened 12 months ago

jsnb-devoted commented 12 months ago

Overview

Adds config blocks to the upstream "source" models according to the dbt project file.

Update type - breaking / non-breaking

What does this solve?

I've been working on standardizing model configs in my project and I set the default materialization to table in the dbt_project.yml. It appears to have overridden the materialization for these "source" models such that the subsequent runs had been doing a CREATE OR REPLACE and wiping out all of the data without my knowing. At least that is what I think happened. I copied the project config settings from the repo to include the source block like so:

  dbt_artifacts:
    +schema: dbt_artifacts
    staging:
      +schema: dbt_artifacts
    sources:
      +materialized: incremental
      +on_schema_change: append_new_columns

and I confirmed that my dbt runs now look like:

21:19:30  1 of 34 START sql incremental model dbt_artifacts.exposures .................... [RUN]                                                                                                     
21:19:30  2 of 34 START sql incremental model dbt_artifacts.invocations .................. [RUN]                                                                                                     
...

instead of what they used to look like:

[2023-10-02, 21:02:38 UTC] {dbt_hook.py:191} INFO - 21:02:38  1 of 34 START sql table model dbt_artifacts.exposures .......................... [RUN]
[2023-10-02, 21:02:38 UTC] {dbt_hook.py:191} INFO - 21:02:38  2 of 34 START sql table model dbt_artifacts.invocations ........................ [RUN]

Outstanding questions

Is this actually the expected behavior of the dbt_project configs? I would have thought that because the materialization for the source block was set in the project config for this package that I wouldn't be able to override it like I did.

Is there anything lost by setting these as project configs? I don't feel strongly about this change I'm just throwing it out there that this would have saved me from losing all my data.

What databases have you tested with?

jared-rimmer commented 12 months ago

Hi @jsnb-devoted thanks for opening this PR. I just wanted to clarify my understanding on this.

In your dbt_project.yml file you have configured the materialisation for dbt_artifacts models?

jsnb-devoted commented 12 months ago

Hi @jsnb-devoted thanks for opening this PR. I just wanted to clarify my understanding on this.

In your dbt_project.yml file you have configured the materialisation for dbt_artifacts models?

Hi @jared-rimmer -- This was my project config:

models:
  +persist_docs:
    relation: true
    columns: true
  devoted:
     <... all my project configs> 
  dbt_artifacts:
    +schema: dbt_artifacts
    staging:
      +schema: dbt_artifacts

Then I made the default materialization for my project table so it looked like this:

models:
  +materialized: table
  +transient: true
  devoted:
     <... all my project configs> 
  dbt_artifacts:
    +schema: dbt_artifacts
    staging:
      +schema: dbt_artifacts

And my understanding is that overrode the "source" models materialization in the dbt_artifacts dbt_project file and it started dropping and creating those tables -- I didn't notice for long enough that the snowflake time travel expired and I lost all the data.

I assumed that the project config in dbt_artifacts couldn't be overridden like that. Assuming its not just user error on my part I figure you probably don't want this happening to anyone. You don't want those models to ever have their materialization overridden.

jsnb-devoted commented 11 months ago

bump - @jared-rimmer any thoughts on this?

jsnb-devoted commented 11 months ago

Anyone? Maybe @jaypeedevlin since it looks like you implemented the original feature? Happy to close this if this was just user error or if it isn't any interest.

hash0b commented 11 months ago

The current documentation has this around configurations, maybe it addresses your query.

Note that model materializations and on_schema_change configs are defined in this package's dbt_project.yml, so do not set them globally in your dbt_project.yml (see docs on configuring packages):

Configurations made in your dbt_project.yml file will override any configurations in a package (either in the dbt_project.yml file of the package, or in config blocks).

On a side note, with 2.6.1, I get an error on dbt run -s dbt_artifacts, I had to add incremental_strategy to my dbt_project.yml. I am still checking if I am missing something:

                'Compilation Error in model model_executions (models/sources/model_executions.sql)
  Invalid incremental strategy provided: None
      Expected one of: \'append\', \'merge\', \'insert_overwrite\'

  > in macro dbt_spark_validate_get_incremental_strategy (macros/materializations/incremental/validate.sql)
jsnb-devoted commented 11 months ago

The current documentation has this around configurations, maybe it addresses your query.

Note that model materializations and on_schema_change configs are defined in this package's dbt_project.yml, so do not set them globally in your dbt_project.yml (see docs on configuring packages):

This is a good call out @hash0b. I did not see this note in the README -- had I read this I probably wouldn't have lost all that data.

I don't feel that strongly about it but it seems like there isn't any benefit to letting someone override the materialization for these particular "source" style models. You can prevent that from happening by hardcoding the materialization like I did in the PR. It's no longer an issue for me anymore so it is really up to you all maintainers @jared-rimmer @jaypeedevlin

jaypeedevlin commented 11 months ago

Hi all,

just to add here that I no longer work for Brooklyn Data so am not actively involved in maintaining this package.