Remove / reduce downtime when running full refresh

alepuccetti commented 3 years ago

Describe the feature

Change the full refresh behaviour for BigQuery (or at least make it configurable).

Current behaviour:

Delete model destination table.
Run create or replace statement.

Proposed behaviour:

Create the full refreshed model in a temporary table (e.g. model_table_name__tmp).
Delete the model_table_name if full refresh is successful.
Copy model_table_name__tmp to model_table_name
Delete model_table_name__tmp

This solution has two advantages:

In case the query create or replace query fails the original table is still present.
Minimize downtime to just the time between the deletion and the copy instead of the time to run the query, which could be several minutes or even hours.

Describe alternatives you've considered

No alternative solution found.

Additional context

I am using BigQuery, not sure if could be extended to other.

Who will this benefit?

Anyone that needs to run very big full refresh. This issue effects them linearly with the query execution time. The solution reduces to the copy time, usually much faster than query execution for complex queries.

Are you interested in contributing this feature?

With some guidance.

jtcohen6 commented 3 years ago

Hey @alepuccetti, this issue is most welcome! Here's the deal: Normally, in full-refresh mode, dbt can atomically replace a BigQuery table by running create or replace table. This should work the vast majority of the time. That statement will fail, however, if the partition/cluster config has changed (between the existing table and the current model / new table to be built), so instead dbt "hard refreshes" the existing table:

https://github.com/fishtown-analytics/dbt/blob/24e4b75c354e0f6bc5af7b51c4bb220428c84d5b/plugins/bigquery/dbt/include/bigquery/macros/materializations/incremental.sql#L105-L112

What I'm hearing from you is: In the case where the table is not replaceable and must be hard-refreshed, dbt should try building the new table first in a temp location, then (if and only if the build succeeds) drop and swap. I completely agree! It's preferable for both of the reasons you mention, and this is how dbt replaces table models on databases without atomic create or replace DDL (e.g. Postgres, Redshift).

In fact, dbt-bigquery already has an adapter.rename_relation method that will take care of steps 3-4 (copy + delete temp version).

So I think this would look like changing the code above:

  {% elif full_refresh_mode %} 
     {#-- If the partition/cluster config has changed, then we must hard refresh: --#}
     {#-- create new table in a temp location, drop, and swap --#} 
     {% set must_hard_refresh = (not adapter.is_replaceable(existing_relation, partition_by, cluster_by)) %} 
     {% if must_hard_refresh %}
         {% do log("Creating " ~ tmp_relation ~ " because " ~ existing_relation ~ " is not replaceable") %} 
         {% set build_sql = create_table_as(False, tmp_relation, sql) %} 
     {% else %}
         {% set build_sql = create_table_as(False, target_relation, sql) %} 
     {% endif %}
 {% else %}

Then further down, after calling the 'main' statement that includes build_sql but before running hooks, you would add:

{% if full_refresh_mode and must_hard_refresh %}
    {% do log("Hard refreshing " ~ existing_relation ~ " because it is not replaceable") %} 
    {{ adapter.drop_relation(existing_relation) }}
    {{ adapter.rename_relation(tmp_relation, existing_relation) }}
{% endif %}

And I think that would just work!

It sounds like this is a code change you're interested in contributing. Let me know if you'd like any help getting started :)

alepuccetti commented 3 years ago

That statement will fail, however, if the partition/cluster config has changed (between the existing table and the current model / new table to be built), so instead dbt "hard refreshes" the existing table:

Exactly other wise a simple create or replace would do. Also, this could play well with another issue I saw regarding applying the model changes after the tests passes and not running the tests after the model has been re-created (sorry, cannot find the issue now).

Thank for the pointers for the implementation 😉.

However regarding this:

In the case where the table is not replaceable and must be hard-refreshed

My tests shows that dbt always does the delete / re-create behaviour even when the partition/clustering does not change but from your explanation it seems that should not be the case. My test model has amaterialized='incremental'andincremental_strategy=insert_overwrite`. And my debug log is:

2021-03-10 10:53:48.234858 (Thread-1): Began running node model.dbt_project.full_refresh_test
2021-03-10 10:53:48.237000 (Thread-1): 11:53:48 | 1 of 1 START incremental model alessandro__staging.full_refresh_test. [RUN]
2021-03-10 10:53:48.237549 (Thread-1): Acquiring new bigquery connection "model.dbt_project.full_refresh_test".
2021-03-10 10:53:48.237818 (Thread-1): Compiling model.project.full_refresh_test
2021-03-10 10:53:48.268434 (Thread-1): Writing injected SQL for node "model.dbt_project.full_refresh_test"
2021-03-10 10:53:48.361345 (Thread-1): Opening a new connection, currently in state closed
2021-03-10 10:53:48.637873 (Thread-1): Hard refreshing `project_id`.`alessandro__staging`.`full_refresh_test` because it is not replaceable
2021-03-10 10:53:48.865436 (Thread-1): Writing runtime SQL for node "model.dbt_project.full_refresh_test"

For some reason, this happen always. I have also tested with materialized='table'. @jtcohen6. Am I missing something or this should not the desired behaviour?

jtcohen6 commented 3 years ago

My tests shows that dbt always does the delete / re-create behaviour even when the partition/clustering does not change but from your explanation it seems that should not be the case.

Indeed, that's not the desired behavior; it sounds like a bug with is_replaceable. Which version of dbt are you using? Any chance you could stick ipbd/breakpoint() into that method, try running full_refresh_test, and see what's up?

The proposal in this issue still has merit regardless—making hard refresh more seamless with less downtime is a net-good thing—but really they should be a rare occurrence.

alepuccetti commented 3 years ago

Which version of dbt are you using?

I am using 0.19.0

jtcohen6 commented 3 years ago

@alepuccetti Hm, I'm unable to reproduce this. Could you open a separate issue with all information about your model config (both in config() and in dbt_project.yml), so we can get to the bottom of what's happening?

Ideally, we'd be able to resolve both issues:

Ensure that dbt is hard-refreshing models only when it strictly needs to (and figure out if/why it's doing it more often than it should)
When dbt must hard-refresh a model, make the process as seamless as possible (minimal downtime, better rollback on failure).

Let's keep this issue open for discussion of the latter.

gareginordyan commented 2 years ago

if the partition/cluster config has changed (between the existing table and the current model / new table to be built), so instead dbt "hard refreshes" the existing table

@jtcohen6 this was some time ago, but if I remember correctly, in our case our cluster/partition keys had not been changing and we were still having this downtime on a regular basis.

They were just simply regular materialized models (non-incremental) that had both partition and cluster keys. I'll try to dig out the actual config.

cc: @bcolbert978

gareginordyan commented 2 years ago

@bcolbert978 I can't seen to find it in slack, but if you by any chance have the model config handy please paste here

balazs-mate commented 2 years ago

@gareginordyan : we had our fair share of problems with strange hard refreshes on non-incremental but partitioned and clustered tables. It all came down to the fact that we were using the cluster_by: "column_a,column_b" syntax instead of the documented cluster_by = ["column_a", "column_b"], syntax and when dbt was checking if the table is replaceable or not, it always came back with False since ["column_a,column_b"] does not equal ["column_a", "column_b"]. https://github.com/dbt-labs/dbt-bigquery/blob/07bc1442c2e0711822983999b18d68ccd029cd41/dbt/adapters/bigquery/impl.py#L532

bcolbert978 commented 2 years ago

@gareginordyan yes tracked it down - and @balazs-mate it was just as you describe with models materialized as table (so non-incremental and not full-refresh specific) and tables that were both partitioned and clustered (with a complex, four-column clustering scheme). We did not have the same syntax issues you describe, though.

Example of the config causing issues:

{{
  config(
    schema='transforms_bi',
    partition_by={'field': 'date', 'granularity': 'day'},
    cluster_by = ['sf_account_id', 'fivetran_account_id', '`group`', 'integration',],
    materialized='table'
  )
}}

We were able to resolve the downtime issue by removing partitioning and dropping our clustering scheme to two columns:

{{
  config(
    schema='transforms_bi',
    cluster_by = ['sf_account_id', 'date'],
    materialized='table'
  )
}}

which led us to believe that the downtime issue was related to the way dbt and/or BigQuery was handling creating the partitioning - the raw downtime (as measured by the slot time consumed by the new version) was lower but not significantly lower. We now favor clustering over partitioning in nearly all cases and haven't seen these issues since.

This was all back in October 2021, running dbt version 0.19.1.

github-actions[bot] commented 2 years ago

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please remove the stale label or comment on the issue, or it will be closed in 7 days.

OmarArain commented 1 year ago

I'm seeing a similar issue. I am materializing a BigQuery table that is partitioned and clustered - I run dbt run --select my_table every 15 minutes, and the downstream consumers report periodic downtime with errors like Not found: Table my_table was not found in location US.

dbt-labs / dbt-bigquery