dbt-labs / dbt-bigquery

dbt-bigquery contains all of the code required to make dbt operate on a BigQuery database.
https://github.com/dbt-labs/dbt-bigquery
Apache License 2.0
219 stars 154 forks source link

[Bug] --empty flag not working on Pseudo-columns #1243

Open christopherekfeldt opened 5 months ago

christopherekfeldt commented 5 months ago

Is this a new bug in dbt-bigquery?

Current Behavior

When trying out the empty flag on my models I get failures on all models that uses the pseudo-column "_PARTITIONTIME" in their logic. Here is my query, it has worked perfectly fine prior.

{% set src_cpc_raw = source('customer_preference_center', 'customer_preference') -%}

select
    customerId,
    customerId_token,
    preference,
    preferenceInd,
    createTS,
    updateTS,
    operator,
    ingstn_ts,
    ingestion_dt
from (
    select
        customerId,
        customerId_token,
        centralPreferences.preference,
        centralPreferences.preferenceInd,
        parse_timestamp("%Y-%m-%d %k:%M:%E6S", createTS) as createTS,
        parse_timestamp("%Y-%m-%d %k:%M:%E6S", updateTS) as updateTS,
        centralPreferences.operator,
        ingstn_ts,
        _PARTITIONTIME as ingestion_dt
    from
        {{ src_cpc_raw }},
        unnest(centralPreferences) as centralPreferences
)
{% if is_incremental() %}
    where date(ingestion_dt) >= date_sub("{{ latest_partition_filter(src_cpc_raw) }}", interval 1 day)
{% endif %}
qualify row_number() over (partition by customerId_token, preference order by updateTS desc, ingstn_ts desc) = 1

But now it has swapped out the logic with a subquery that doesn't take the pseudo column into consideration:

/* {"app": "dbt", "dbt_version": "1.8.0", "profile_name": "etlapp", "target_name": "lab", "node_id": "model.batch_framework_module.harm_customer_preference_center__centralpreference"} */   

    create or replace table `ad25-p-datalab-fg2h`.`dbt_christopher`.`harm_customer_preference_center__centralpreference`

    cluster by ingstn_ts

    OPTIONS(
      description="""Incremental model for central preferences""",

      labels=[('batchfw_status', 'managed')]
    )
    as (
      select
    customerId,
    customerId_token,
    preference,
    preferenceInd,
    createTS,
    updateTS,
    operator,
    ingstn_ts,
    ingestion_dt
from (
    select
        customerId,
        customerId_token,
        centralPreferences.preference,
        centralPreferences.preferenceInd,
        parse_timestamp("%Y-%m-%d %k:%M:%E6S", createTS) as createTS,
        parse_timestamp("%Y-%m-%d %k:%M:%E6S", updateTS) as updateTS,
        centralPreferences.operator,
        ingstn_ts,
        _PARTITIONTIME as ingestion_dt
    from
        (select * from `ab73-np-rawlay-dev-3324`.`customer_preference_center`.`customer_preference` where false limit 0),
        unnest(centralPreferences) as centralPreferences
)

qualify row_number() over (partition by customerId_token, preference order by updateTS desc, ingstn_ts desc) = 1
    );

Giving the error in BigQuery: "Unrecognized name: _PARTITIONTIME at [37:9]"

Expected Behavior

I expect the subquery to work with pseudo-columns as well.

Steps To Reproduce

  1. Use similiar SQL logic.
  2. Run dbt build -s model_name --empty

Relevant log output

No response

Environment

- OS: ubuntu:rolling
- Python: 3.9.11
- dbt-core: 1.8.0
- dbt-bigquery: 1.8.0

Additional Context

No response

jtcohen6 commented 5 months ago

@christopherekfeldt Thanks for the report!

The mechanism we're using for --empty is to wrap the source() and ref() calls in a subquery with select * ... where false limit 0. This * doesn't pass along pseudo-columns.

The first idea that came to mind:

select *, _PARTITIONTIME as partition_time
from dbt_jcohen.myingestiontable
where false limit 0

Other ideas:

jtcohen6 commented 5 months ago

In the meantime, you can at least avoid the error by specifying .render() on any refs/sources that you don't want dbt to turn into where false limit 0 subqueries.

If we added support for flags.EMPTY, then you could write something like:

{% set src_cpc_raw = source('customer_preference_center', 'customer_preference') -%}

    select
        ...,
        _PARTITIONTIME as ingestion_dt
    from
        {{ src_cpc_raw.render() }},   -- this will be rendered simply into `project.dataset.identifier` (no subquery)
        unnest(centralPreferences) as centralPreferences
     where 1=1
{% if flags.EMPTY %}
    and false limit 0                 -- instead, I manually add the "empty limit" here
{% endif %}
{% if is_incremental() %}
    and date(ingestion_dt) >= date_sub("{{ latest_partition_filter(src_cpc_raw) }}", interval 1 day)
{% endif %}
qualify row_number() over (partition by customerId_token, preference order by updateTS desc, ingstn_ts desc) = 1
github-christophe-oudar commented 4 months ago

My suggestion to solve this issue is related to https://github.com/dbt-labs/dbt-core/discussions/8560: we need to be able to override the rendering from sources/refs.

For sources, we could have a way to add a parameter to the macro to add those metadata fields and for refs, since it would be related to "time_ingestion_partitioning": True,, we should be able to detect them by ourselves.