brooklyn-data / dbt_artifacts

A dbt package for modelling dbt metadata. https://brooklyn-data.github.io/dbt_artifacts
Apache License 2.0
309 stars 113 forks source link

[Feature]: Snapshots of "source" tables? #410

Open codigo-ergo-sum opened 7 months ago

codigo-ergo-sum commented 7 months ago

Overview

I'm not sure if this is really a feature request or if I'm misunderstanding something about how this (very helpful and much appreciated) package works.

The "source" tables (e.g. models, tests, etc.) where the dbt execution info is loaded via INSERT INTO statements are defined as incremental models. I understand on one level why this is done - if you just used a standard table materialization then every time the model was run the data would get removed, but the issue for us is that we run --full-refresh on our existing dbt project fairly often, i.e. once a week or so. We do this because we have a number of incremental models which cut down on query processing time and costs but occasionally there are some really late-arriving facts or other sneaky data cases that sneak past our incremental windowing filters, so just to be safe it's easier for us to do a full trunc and reload on our incremental tables every so often.

But... This causes a problem with this package because it means we'll lose all the historical invocation information every time we do a --full-refresh. Yes we could do some exclusions on our dbt job invocations in our dev and prod environments like --exclude dbt-artifacts but that's fragile and vulnerable to somebody forgetting to not exclude those when they're editing dbt job parameters in dbt cloud. We want to keep all of that over time and not have to worry about ever losing all of our dbt invocation history just because somebody accidentally configured a job wrong or didn't understand.

Am I understanding the situation correctly here? Wouldn't others want to ensure that they keep the full invocation history even when running a full refresh? And in that case wouldn't dbt snapshots be the proper method to do this? So, have the source tables that could be incremental, or just regular table materializations, but then have a snapshot on top of them to capture and persist the data?

Let me know if I misunderstood the intention and use case here, maybe I'm not "getting" it.

What problem would this solve?

It would ensure persistence of dbt artifact data over time regardless of how other incremental models are being used and refreshed in the project.

Would you be willing to contribute?

Yes but not able to at this exact moment.