dbt-labs / docs.getdbt.com

The code behind docs.getdbt.com
https://docs.getdbt.com/
Apache License 2.0
114 stars 871 forks source link

Incremental Models - `unique_key` possibly has different behavior than stated #4355

Open fathom-parth opened 8 months ago

fathom-parth commented 8 months ago

Contributions

Link to the page on docs.getdbt.com requiring updates

https://docs.getdbt.com/docs/build/incremental-models#defining-a-unique-key-optional

What part(s) of the page would you like to see updated?

Context

I'd like to use incremental models to replace all records with a specific timestamp (an ingest timestamp) with the new rows coming in.

When reading over the dbt incremental docs, it seems the merge strategy would be the best for this; however, this strategy requires a unique_key.

The Problem

In this case, there's no unique_key that I want to define per record and I'd rather have the incremental strat remove every row related to a specific column value (in my case, an ingest timestamp).

I asked this question on slack here: https://getdbt.slack.com/archives/CBSQTAPLG/p1698417729305499

And some people mentioned that they do this by defining a unique_key that's not actually unique per-record and that this works fine for them.

One of the users on slack also mention:

The docs are misleading.. and the naming of the config is also confusing. I’m sure if you search slack for unique_key you’ll see many people mention what Daron and I suggested.

The documentation seems to conflict with the above suggestion however by stating:

The optional unique_key parameter specifies a field (or combination of fields) that define the grain of your model. That is, the field(s) identify a single unique row. You can define unique_key in a configuration block at the top of your model, and it can be a single column name or a list of column names.

and

Please note that if there's a unique_key with more than one row in either the existing target table or the new incremental rows, the incremental model may fail depending on your database and incremental strategy.

This conflict is confusing to me and I'd like to know if the docs need to be updated or if the users' experience is undefined.

Additional information

For reference, I'm using the bigquery adapter.

fathom-parth commented 7 months ago

@mirnawong1 Is this the right place for me to open this issue? Is there a different repo that would get this more traction? Trying to see if my organization can use this feature or if this is undefined behavior

mirnawong1 commented 7 months ago

hey @fathom-parth this is def the right place. I'm going to consult with our developer experience team to see how to best address this for you. thank you for flagging this and apologies for the delay here!

dbeatty10 commented 7 months ago

Take a look at the insert_overwrite strategy

@fathom-parth If you are using dbt-bigquery, what you are describing sounds like a good use-case for the insert_overwrite strategy.

You can read commentary about the insert_overwrite strategy from other practitioners here:

Reports from other users on Slack

We'd need to learn more about their specific details before making updates to our documentation.

I don't know one way or the other how defining a unique_key that's not actually unique would behave without knowing which dbt adapter they are using and seeing an example of their source data + model configs + model definitions.

Our docs for dbt-snowflake say that:

Snowflake's merge statement fails with a "nondeterministic merge" error if the unique_key specified in your model config is not actually unique. If you encounter this error, you can instruct dbt to use a two-step incremental approach by setting the incremental_strategy config for your model to delete+insert.

So possibly they are using the delete+insert strategy with a platform that supports it. Or possibly the platform they are using has a merge statement that allows for non-unique keys.

fathom-parth commented 7 months ago

@dbeatty10 Thanks for all the links!

I did look into the insert_overwrite strategy but that wouldn’t work for us since that overwrites entire partitions.

We currently set partitioning on the ingest timestamp column (a datetime); however, the partitioning is set to the granularity of a month in accordance to BQ’s best practices so we don’t run into the maximum number of allowed partitions and we follow BQ's guidelines around the most optimal size per partition.

Overwriting the entire month of data would be expensive and unideal. Ideally we’d only overwrite rows from the same exact timestamp

Lmk if I’ve grossly misunderstood insert overwrite or if I’m missing something obvious!!

Edited at 2023-12-13 11:13am ET for clarity (wrote the above half-asleep)

daronjp commented 7 months ago

@dbeatty10 responding to your question from Slack:

Using the Snowflake adapter

-- my_model.sql
{{
    config(
        ...
        materialized='incremental',
        incremental_strategy='delete+insert',
        unique_key='data_date'
        ...
    )
}}

SELECT
    data_date,
    col_1,
    col_2
FROM {{ ref('my_intermediate_model' }}
WHERE
    ...
    {% if is_incremental() %}
        AND data_date >= CURRENT_DATE - 10
    {% endif %}
fathom-parth commented 7 months ago

Interesting, I guess the BQ specific documentation doesn't mention the unique_key needing to be unique: https://docs.getdbt.com/reference/resource-configs/bigquery-configs#the-merge-strategy

fathom-parth commented 7 months ago

Ah though looking through the code on how the merge statement gets generated by the BQ adapter (seems to use the default merge statement from dbt-core) and how BQ handles merge statements, I'm pretty sure it'll fail if the predicate matches multiple rows.

dbeatty10 commented 5 months ago

There are two different use cases in which unique_key is utilized today for incremental materializations:

  1. “merge on an actually unique key”
  2. “replace partitions”

For (1), the upstream source would benefit from adding unique and not_null checks since it must be unique to behave correctly.

For (2), there should not be any uniqueness check, because we'd expect many rows to have the same value for unique_key. This case is a flagrant misnomer, of course 😱

It would be nice to formally distinguish between each of those use-cases, and https://github.com/dbt-labs/dbt-core/issues/9490 might be a good way to accomplish that.