dbt-labs / dbt-bigquery

dbt-bigquery contains all of the code required to make dbt operate on a BigQuery database.
https://github.com/dbt-labs/dbt-bigquery
Apache License 2.0
217 stars 153 forks source link

[CT-1641] [CT-1622] [Bug] Insert overwrite incremental models using create or replace table #424

Open roycefp opened 1 year ago

roycefp commented 1 year ago

Is this a new bug in dbt-core?

Current Behavior

When I run an incremental table update using the insert_overwrite strategy and require_partition_filter=true and using static partitions, the table runs a create or replace table statement with a query syntax equivalent to (select * from table).

This returns an error: Cannot query over table XX without a filter over column(s) XX that can be used for partition elimination

I observed this for several models but others work fine. There is no difference in how these models are built.

Expected Behavior

Steps To Reproduce

I'll exclude the main body of the model for brevity but the config params are:

{{
  config(
      materialized='incremental',
      incremental_strategy = "insert_overwrite",
      cluster_by="country",
      partition_by = {
          "field": "report_date",
          "data_type": "date",
          "granularity": "day"
      },
      partitions = [DATE_SUB(CURRENT_DATE, INTERVAL 1 DAY), DATE_SUB(CURRENT_DATE, INTERVAL 2 DAY)],
      require_partition_filter=true,
      on_schema_change="sync_all_columns"
  )
}}

after running the model, i looked at the logs

  1. the 1st query created the tmp table correctly although its not used
  2. the 2nd query ran a create or replace table instead of a merge

/ {"app": "dbt", "dbt_version": "1.1.2", "profile_name": "bigquery", "target_name": "prod", "node_id": "model_name"} /

create or replace table model_name partition by report_date cluster by country OPTIONS( description="""table description""",

  labels=[('label')],

  require_partition_filter=True
)

as (

select
      (all columns)
from `model_name`

);

error is return as the create or replace statement is not using the partition filter, but it is also not correct in the first place as it should be a merge statement

### Relevant log output

_No response_

### Environment

```markdown
- OS: Big Sur 11.6
- Python: 3.8.0
- dbt:1.1.2

Which database adapter are you using with dbt?

bigquery

Additional Context

No response

dbeatty10 commented 1 year ago

Thanks for reporting this @roycefp !

I took a first shot at reproducing the Cannot query over table ... error you saw, but I wasn't able to produce an error.

Can you try the code I have below and see if it generates an error on your side?

If it works without error, are you able to tweak it so that it does trigger the error?

models/incremental_424.sql

{{
    config(
        materialized="incremental",
        incremental_strategy="insert_overwrite",
        cluster_by="country",
        partition_by={
            "field": "report_date",
            "data_type": "date",
            "granularity": "day"
        },
        partitions=["DATE_SUB(CURRENT_DATE, INTERVAL 1 DAY)", "DATE_SUB(CURRENT_DATE, INTERVAL 2 DAY)"],
        require_partition_filter=true,
        on_schema_change="sync_all_columns"
    )
}}

with data as (

    {% if not is_incremental() %}

        select 1 as country, cast('2022-12-13' as date) as report_date union all
        select 2 as country, cast('2022-12-13' as date) as report_date union all
        select 3 as country, cast('2022-12-13' as date) as report_date union all
        select 4 as country, cast('2022-12-13' as date) as report_date

    {% else %}

        -- we want to overwrite the 4 records in the 2022-12-13 partition
        -- with the 2 records below, but add two more in the 2022-12-14 partition
        select 10 as country, cast('2022-12-13' as date) as report_date union all
        select 20 as country, cast('2022-12-13' as date) as report_date union all
        select 30 as country, cast('2022-12-14' as date) as report_date union all
        select 40 as country, cast('2022-12-14' as date) as report_date

    {% endif %}

)

select * from data

Run the following in order to do two iterations of the incremental model:

dbt run -s incremental_424 --full-refresh
dbt run -s incremental_424

I used dbt-core=1.1.2 and dbt-bigquery=1.1.1.

Fleid commented 1 year ago

Hi @roycefp, since we can't reproduce, I will close this now. Please comment below if needed and we will re-open.

brebertus21 commented 1 year ago

Hi @dbeatty10, looks like we're running into the same issue. With some further debugging, it appears that is_incremental() is returning false when we'd expect it to return true. Specifically, relation is none in the logic here. Similar to the example above this is only affecting one model out of several incremental models.

Wondering if you have any tips on how relation is being determined here so we can dig deeper?

Currently running: dbt-core==1.2.0 and dbt-bigquery==1.2.0

Fleid commented 1 year ago

Re-opening! @dbeatty10 do you still have context, or do you want me to take over?

dbeatty10 commented 1 year ago

@Fleid you can take over!

Here's the only other context that might be helpful: It looks like is_incremental() uses adapter.get_relation(). https://github.com/dbt-labs/dbt-core/issues/7024 is another recent issue involving an is_{something} macro and adapter.get_relation. @mikealfare was recently troubleshooting that one so might be worth asking him if there's similarities here.

Fleid commented 1 year ago

So I will ask Mike to take a look at it, but he's super busy right now. @brebertus21 you can follow up on the conversation in the issue Doug linked to. It's a complex one that we're having a tough time getting to the bottom of. What it means is that it will take time for us to resolve this. In the meantime, do you have a workaround?

brebertus21 commented 1 year ago

Hey @Fleid, thanks for following up. Did some digging on this the past few days. I was able to reduce the failing table down to something like

SELECT
  DATE("2023-03-10") as dt,
  "id" as user_id

and was still seeing issues. Eventually ended up changing the name of the ref itself from user_*.sql to bug_user.sql and that amazingly that fixed the issue for this ref only. Eventually got around to trying to materialize this in a different schema and everything ran successfully (including some downstream refs that were failing with the same issue, once the name-workaround was in place).

This is in a development schema that my team has been using for almost a year that at this point is cluttered with old products. My hypothesis is that when dbt queries the schema metadata at query-render time, this operation is timing out (or similar) on the large schema, but that error is being caught and none is returned.

brebertus21 commented 1 year ago

It was suspicious that the ref name made any difference, until I got to thinking that maybe there is some inherent ordering in the metadata dbt is scanning

mikealfare commented 1 year ago

I'm assigning this to me for now to keep it on my radar, though there are things ahead of it as @Fleid points out. From a very cursory scan, it does appear related to the other adapter.get_relation() issue that we saw.

roycefp commented 1 year ago

hello @mikealfare / @Fleid! following-up on this to check if you have any updates, as we recently ran into the same issue again :)

our interim solution has been to set require_partition_filter=false just so that we're not forced into an error, but that's obviously a workaround instead of solving the issue at hand, as we're not familiar with the root cause.

should we be watching for this feature as a fix? https://github.com/dbt-labs/dbt-utils/issues/833

cc: @shaznishoffie @chinwee-foodpanda

Fleid commented 1 year ago

Yes @roycefp - the dbt-utils issue is the one to follow. Sorry about the delay ><

dbeatty10 commented 12 months ago

@roycefp We don't actually know for certain if https://github.com/dbt-labs/dbt-utils/issues/833 is related to https://github.com/dbt-labs/dbt-bigquery/issues/424 or not!

Our current barrier here is that we don't yet have a reproducible test case ("reprex"). Do you have any insights how we might be able to reproduce the situation you are seeing?

We're in a holding pattern until we can reproduce this one 😢

Labeling this as "awaiting_response" as an indicator that we're seeking help to reproduce this.

github-actions[bot] commented 9 months ago

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days.

github-actions[bot] commented 8 months ago

Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest. Just add a comment to notify the maintainers.

Haebuk commented 6 months ago

hey I can reproduce error @dbeatty10 's example:

{{
    config(
        materialized="incremental",
        incremental_strategy="insert_overwrite",
        cluster_by="country",
        partition_by={
            "field": "report_date",
            "data_type": "datetime",
            "granularity": "day"
        },
        partitions=["DATE_SUB(CURRENT_DATE, INTERVAL 1 DAY)", "DATE_SUB(CURRENT_DATE, INTERVAL 2 DAY)"],
        require_partition_filter=true,
        on_schema_change="sync_all_columns"
    )
}}

with data as (

    {% if not is_incremental() %}

        select 1 as country, cast('2022-12-13 11:00:00' as datetime) as report_date union all
        select 2 as country, cast('2022-12-13 11:00:00' as datetime) as report_date union all
        select 3 as country, cast('2022-12-13 11:00:00' as datetime) as report_date union all
        select 4 as country, cast('2022-12-13 11:00:00' as datetime) as report_date

    {% else %}

        -- we want to overwrite the 4 records in the 2022-12-13 partition
        -- with the 2 records below, but add two more in the 2022-12-14 partition
        select 10 as country, cast('2022-12-13 11:00:00' as datetime) as report_date union all
        select 20 as country, cast('2022-12-13 11:00:00' as datetime) as report_date union all
        select 30 as country, cast('2022-12-14 11:00:00' as datetime) as report_date union all
        select 40 as country, cast('2022-12-14 11:00:00' as datetime) as report_date

    {% endif %}

)

select * from data

As you know, original example works, but convert date to datetime, it fails.

Query error: Cannot query over table '<table>' without a filter over column(s) 'report_date' that can be used for partition elimination at [10:5]
dbeatty10 commented 6 months ago

Thanks for sharing that reprex in https://github.com/dbt-labs/dbt-bigquery/issues/424#issuecomment-2060591142 @Haebuk !

When I tried running this command, I saw the same error message that both you and the original poster mentioned:

dbt run -s incremental_424

So I'm going to re-open this issue.

I'm still not sure if this is unexpected behavior or not, so I'm going to leave the triage label on for us to take a closer look.

Were you able to find any way to work around this?

Haebuk commented 5 months ago

@dbeatty10 sorry to late reply, We couldn't find solution :(