Closed alepuccetti closed 2 years ago
Hey @alepuccetti, this issue is most welcome! Here's the deal: Normally, in full-refresh mode, dbt can atomically replace a BigQuery table by running create or replace table
. This should work the vast majority of the time. That statement will fail, however, if the partition/cluster config has changed (between the existing table and the current model / new table to be built), so instead dbt "hard refreshes" the existing table:
What I'm hearing from you is: In the case where the table is not replaceable and must be hard-refreshed, dbt should try building the new table first in a temp location, then (if and only if the build succeeds) drop and swap. I completely agree! It's preferable for both of the reasons you mention, and this is how dbt replaces table models on databases without atomic create or replace
DDL (e.g. Postgres, Redshift).
In fact, dbt-bigquery already has an adapter.rename_relation
method that will take care of steps 3-4 (copy + delete temp version).
So I think this would look like changing the code above:
{% elif full_refresh_mode %}
{#-- If the partition/cluster config has changed, then we must hard refresh: --#}
{#-- create new table in a temp location, drop, and swap --#}
{% set must_hard_refresh = (not adapter.is_replaceable(existing_relation, partition_by, cluster_by)) %}
{% if must_hard_refresh %}
{% do log("Creating " ~ tmp_relation ~ " because " ~ existing_relation ~ " is not replaceable") %}
{% set build_sql = create_table_as(False, tmp_relation, sql) %}
{% else %}
{% set build_sql = create_table_as(False, target_relation, sql) %}
{% endif %}
{% else %}
Then further down, after calling the 'main'
statement that includes build_sql
but before running hooks, you would add:
{% if full_refresh_mode and must_hard_refresh %}
{% do log("Hard refreshing " ~ existing_relation ~ " because it is not replaceable") %}
{{ adapter.drop_relation(existing_relation) }}
{{ adapter.rename_relation(tmp_relation, existing_relation) }}
{% endif %}
And I think that would just work!
It sounds like this is a code change you're interested in contributing. Let me know if you'd like any help getting started :)
That statement will fail, however, if the partition/cluster config has changed (between the existing table and the current model / new table to be built), so instead dbt "hard refreshes" the existing table:
Exactly other wise a simple create or replace
would do. Also, this could play well with another issue I saw regarding applying the model changes after the tests passes and not running the tests after the model has been re-created (sorry, cannot find the issue now).
Thank for the pointers for the implementation 😉.
However regarding this:
In the case where the table is not replaceable and must be hard-refreshed
My tests shows that dbt always does the delete / re-create behaviour even when the partition/clustering does not change but from your explanation it seems that should not be the case. My test model has amaterialized='incremental'and
incremental_strategy=insert_overwrite`. And my debug log is:
2021-03-10 10:53:48.234858 (Thread-1): Began running node model.dbt_project.full_refresh_test
2021-03-10 10:53:48.237000 (Thread-1): 11:53:48 | 1 of 1 START incremental model alessandro__staging.full_refresh_test. [RUN]
2021-03-10 10:53:48.237549 (Thread-1): Acquiring new bigquery connection "model.dbt_project.full_refresh_test".
2021-03-10 10:53:48.237818 (Thread-1): Compiling model.project.full_refresh_test
2021-03-10 10:53:48.268434 (Thread-1): Writing injected SQL for node "model.dbt_project.full_refresh_test"
2021-03-10 10:53:48.361345 (Thread-1): Opening a new connection, currently in state closed
2021-03-10 10:53:48.637873 (Thread-1): Hard refreshing `project_id`.`alessandro__staging`.`full_refresh_test` because it is not replaceable
2021-03-10 10:53:48.865436 (Thread-1): Writing runtime SQL for node "model.dbt_project.full_refresh_test"
For some reason, this happen always. I have also tested with materialized='table'
. @jtcohen6. Am I missing something or this should not the desired behaviour?
My tests shows that dbt always does the delete / re-create behaviour even when the partition/clustering does not change but from your explanation it seems that should not be the case.
Indeed, that's not the desired behavior; it sounds like a bug with is_replaceable
. Which version of dbt are you using? Any chance you could stick ipbd
/breakpoint()
into that method, try running full_refresh_test
, and see what's up?
The proposal in this issue still has merit regardless—making hard refresh more seamless with less downtime is a net-good thing—but really they should be a rare occurrence.
Which version of dbt are you using?
I am using 0.19.0
@alepuccetti Hm, I'm unable to reproduce this. Could you open a separate issue with all information about your model config (both in config()
and in dbt_project.yml
), so we can get to the bottom of what's happening?
Ideally, we'd be able to resolve both issues:
Let's keep this issue open for discussion of the latter.
if the partition/cluster config has changed (between the existing table and the current model / new table to be built), so instead dbt "hard refreshes" the existing table
@jtcohen6 this was some time ago, but if I remember correctly, in our case our cluster/partition keys had not been changing and we were still having this downtime on a regular basis.
They were just simply regular materialized models (non-incremental) that had both partition and cluster keys. I'll try to dig out the actual config.
cc: @bcolbert978
@bcolbert978 I can't seen to find it in slack, but if you by any chance have the model config handy please paste here
@gareginordyan : we had our fair share of problems with strange hard refreshes on non-incremental but partitioned and clustered tables. It all came down to the fact that we were using the cluster_by: "column_a,column_b"
syntax instead of the documented cluster_by = ["column_a", "column_b"],
syntax and when dbt was checking if the table is replaceable or not, it always came back with False since ["column_a,column_b"]
does not equal ["column_a", "column_b"]
. https://github.com/dbt-labs/dbt-bigquery/blob/07bc1442c2e0711822983999b18d68ccd029cd41/dbt/adapters/bigquery/impl.py#L532
@gareginordyan yes tracked it down - and @balazs-mate it was just as you describe with models materialized as table
(so non-incremental and not full-refresh specific) and tables that were both partitioned and clustered (with a complex, four-column clustering scheme). We did not have the same syntax issues you describe, though.
Example of the config causing issues:
{{
config(
schema='transforms_bi',
partition_by={'field': 'date', 'granularity': 'day'},
cluster_by = ['sf_account_id', 'fivetran_account_id', '`group`', 'integration',],
materialized='table'
)
}}
We were able to resolve the downtime issue by removing partitioning and dropping our clustering scheme to two columns:
{{
config(
schema='transforms_bi',
cluster_by = ['sf_account_id', 'date'],
materialized='table'
)
}}
which led us to believe that the downtime issue was related to the way dbt and/or BigQuery was handling creating the partitioning - the raw downtime (as measured by the slot time consumed by the new version) was lower but not significantly lower. We now favor clustering over partitioning in nearly all cases and haven't seen these issues since.
This was all back in October 2021, running dbt version 0.19.1.
This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please remove the stale label or comment on the issue, or it will be closed in 7 days.
I'm seeing a similar issue. I am materializing a BigQuery table that is partitioned and clustered - I run dbt run --select my_table every 15 minutes, and the downstream consumers report periodic downtime with errors like Not found: Table my_table was not found in location US
.
Describe the feature
Change the full refresh behaviour for BigQuery (or at least make it configurable).
Current behaviour:
create or replace
statement.Proposed behaviour:
model_table_name__tmp
).model_table_name
if full refresh is successful.model_table_name__tmp
tomodel_table_name
model_table_name__tmp
This solution has two advantages:
Describe alternatives you've considered
No alternative solution found.
Additional context
I am using BigQuery, not sure if could be extended to other.
Who will this benefit?
Anyone that needs to run very big full refresh. This issue effects them linearly with the query execution time. The solution reduces to the copy time, usually much faster than query execution for complex queries.
Are you interested in contributing this feature?
With some guidance.