Open fathom-parth opened 8 months ago
@mirnawong1 Is this the right place for me to open this issue? Is there a different repo that would get this more traction? Trying to see if my organization can use this feature or if this is undefined behavior
hey @fathom-parth this is def the right place. I'm going to consult with our developer experience team to see how to best address this for you. thank you for flagging this and apologies for the delay here!
insert_overwrite
strategy@fathom-parth If you are using dbt-bigquery, what you are describing sounds like a good use-case for the insert_overwrite
strategy.
You can read commentary about the insert_overwrite
strategy from other practitioners here:
We'd need to learn more about their specific details before making updates to our documentation.
I don't know one way or the other how defining a unique_key
that's not actually unique would behave without knowing which dbt adapter they are using and seeing an example of their source data + model configs + model definitions.
Our docs for dbt-snowflake say that:
Snowflake's
merge
statement fails with a "nondeterministic merge" error if theunique_key
specified in your model config is not actually unique. If you encounter this error, you can instruct dbt to use a two-step incremental approach by setting theincremental_strategy
config for your model todelete+insert
.
So possibly they are using the delete+insert
strategy with a platform that supports it. Or possibly the platform they are using has a merge
statement that allows for non-unique keys.
@dbeatty10 Thanks for all the links!
I did look into the insert_overwrite strategy but that wouldn’t work for us since that overwrites entire partitions.
We currently set partitioning on the ingest timestamp column (a datetime); however, the partitioning is set to the granularity of a month in accordance to BQ’s best practices so we don’t run into the maximum number of allowed partitions and we follow BQ's guidelines around the most optimal size per partition.
Overwriting the entire month of data would be expensive and unideal. Ideally we’d only overwrite rows from the same exact timestamp
Lmk if I’ve grossly misunderstood insert overwrite or if I’m missing something obvious!!
Edited at 2023-12-13 11:13am ET for clarity (wrote the above half-asleep)
@dbeatty10 responding to your question from Slack:
Using the Snowflake adapter
-- my_model.sql
{{
config(
...
materialized='incremental',
incremental_strategy='delete+insert',
unique_key='data_date'
...
)
}}
SELECT
data_date,
col_1,
col_2
FROM {{ ref('my_intermediate_model' }}
WHERE
...
{% if is_incremental() %}
AND data_date >= CURRENT_DATE - 10
{% endif %}
Interesting, I guess the BQ specific documentation doesn't mention the unique_key needing to be unique: https://docs.getdbt.com/reference/resource-configs/bigquery-configs#the-merge-strategy
Ah though looking through the code on how the merge
statement gets generated by the BQ adapter (seems to use the default merge statement from dbt-core) and how BQ handles merge
statements, I'm pretty sure it'll fail if the predicate matches multiple rows.
There are two different use cases in which unique_key
is utilized today for incremental materializations:
For (1), the upstream source would benefit from adding unique
and not_null
checks since it must be unique to behave correctly.
For (2), there should not be any uniqueness check, because we'd expect many rows to have the same value for unique_key
. This case is a flagrant misnomer, of course 😱
It would be nice to formally distinguish between each of those use-cases, and https://github.com/dbt-labs/dbt-core/issues/9490 might be a good way to accomplish that.
Contributions
Link to the page on docs.getdbt.com requiring updates
https://docs.getdbt.com/docs/build/incremental-models#defining-a-unique-key-optional
What part(s) of the page would you like to see updated?
Context
I'd like to use incremental models to replace all records with a specific timestamp (an ingest timestamp) with the new rows coming in.
When reading over the dbt incremental docs, it seems the
merge
strategy would be the best for this; however, this strategy requires aunique_key
.The Problem
In this case, there's no
unique_key
that I want to define per record and I'd rather have the incremental strat remove every row related to a specific column value (in my case, an ingest timestamp).I asked this question on slack here: https://getdbt.slack.com/archives/CBSQTAPLG/p1698417729305499
And some people mentioned that they do this by defining a
unique_key
that's not actually unique per-record and that this works fine for them.One of the users on slack also mention:
The documentation seems to conflict with the above suggestion however by stating:
and
This conflict is confusing to me and I'd like to know if the docs need to be updated or if the users' experience is undefined.
Additional information
For reference, I'm using the bigquery adapter.