Open jtcohen6 opened 3 years ago
@jtcohen6 I can take a stab at this!
Thanks for answering the call, @prratek!
The change in dbt-labs/dbt#3386 looks great at first glance. I'll take a more detailed look tomorrow
Hi @prratek and @jtcohen6, thanks so much for working on this. I'm wondering if we could solve this for an even more flexible use case though.
My team is running across a scenario where we want to use an incremental strategy in a hybrid kind of mode. Our dataset is very large, in the magnitude of many TBs per day partition. We also have late arriving data for about 3 days and so need an incremental solution.
The current merge
strategy without a partition filter on DBT_INTERNAL_DEST
causes the query to hit PBs for the full table making it untenable as an option. If we tried insert_overwrite
, the sheer size of replacing 3 full partitions also takes too long because it is processing so many rows. The thing is, the older 2 dates only need a small fraction of their rows processed and so almost 66% of the 3-day range is unnecessary work. The merge
strategy is so close to being feasible for us if there was just a way to pass to dbt a partition filter for use with merging on DBT_INTERNAL_DEST
. The current proposal here to do a static DBT_INTERNAL_DEST.[partition_col] is not null
is a good start, but if we could just tell it to only load 3 days, we would have a great solution. We don't need dbt to auto-guess which partitions. We would be very happy telling it.
As far as how this helps when the query will still pull 3 full days worth of bytes, we are on slot billing so even with the extra bytes pulled, any extra row pruning that can be slapped on will save significant slot time usage. For people on bytes billing, it would still benefit them on query time despite costing the same.
Hey @joshpeng, thanks for all the context there!
I think https://github.com/fishtown-analytics/dbt/issues/3293 + https://github.com/fishtown-analytics/dbt/pull/3294 may actually be the change you're looking for: the ability to pass custom filters to a predicates
config, and have them populate in the merge
DML statement, alongside the "built-in" predicates such as unique_key
match. Per the proposed syntax, you'd be able to do something like:
{{ config(
partition_by = {"field": "my_partition_col"},
incremental_strategy = "merge",
incremental_predicates = [
"my_partition_col > dateadd(current_date, interval -3 days)"
]
) }}
The goal of this issue, in the meantime, is to make sure that all incremental strategies + require_partition_filter
"just work" for users who don't want to have to think about supplying custom predicates/filters.
@jtcohen6 Amazing. Love how you're always 20 steps ahead of my asks 😄
I have a use-case that doesn't fit with any available incremental strategies:
I have a table with 3 columns as key, and all of them are string.
I can't use merge
with 3 columns,
and I can't use insert_overwrite
because all columns are string type.
Hope the new improvement will help.
I'm trying this new feature with merge strategy and there are a few things I'd like to mention:
I may be mistaken but I can't find anything like the following in v0.20. I've looked at incremental.sql
and merge.sql
It's causing my merge to fail. I'm assuming that it should be in v0.20.
{% is_partition_filter_required = config.get('require_partition_filter', false) %}
{% set predicates = [] %}
{% if is_partition_filter_required %}
{% set partition_filter %}
({{ partition_by.field }} is not null or {{ partition_by.field }} is null)
{% endset %}
{% do predicates.append(partition_filter) %}
{% endif %}
This feature doesn't seem to work for existing tables and it made me think, BigQuery allow for the require_partition_filter
to be switch on and off. It would be good if dbt can make use of this option.
({{ partition_by.field }} is not null or {{ partition_by.field }} is null)
will work well if you are doing an initial load but the ability to specify values would be great. We've managed to do it like so...Config requirements To easily identify the partition field and the source of applicable partition field values, we have added these options to the model’s {{config()}} block.
{{config (
...
partition_filter_field = 'PARTITION_DATE',
partition_filter_source = 'dbt_model_name',
...
)}}
partition_filter_field
- optional, use if different from partition_by
partition_filter_source
- Source of the Partition values.MERGE Code changes We need to change the standard dbt MERGE code to make use of the new option available in model’s {{config()}} block. A copy of the standard macro was placed in the macros folder:
The logic flow is basically:
partition_filter_field
partition_filter_source
Before:
merge into `project`.`schema`.`table` as DBT_INTERNAL_DEST
using (
SELECT
...
) as DBT_INTERNAL_SOURCE
on DBT_INTERNAL_SOURCE.UNIQUE_KEY = DBT_INTERNAL_DEST.UNIQUE_KEY
After:
merge into `project`.`schema`.`table` as DBT_INTERNAL_DEST
using (
SELECT
...
) as DBT_INTERNAL_SOURCE
on DBT_INTERNAL_SOURCE.UNIQUE_KEY = DBT_INTERNAL_DEST.UNIQUE_KEY
and ( DBT_INTERNAL_DEST.PARTITION_DATE = "2020-11-07" or
DBT_INTERNAL_DEST.PARTITION_DATE = "2020-10-28")
Our change cater for integer partitions as well.
Had a bit of a struggle when trying to make the following work:
require_partitions_filter = true
incremental_strategy = 'merge'
bq_generate_incremental_build_sql()
macro using the suggestion above.Was getting "ambiguous column names" with dbt run
and when placing the compiled sql into the BQ console, I had gotten the "Cannot query over table ... without a filter over column(s) ..." error.
The following worked for me:
(1) Override the bq_generate_incremental_build_sql()
macro by adding the following macro to the project:
The only thing I changed from the above suggestions was adding DBT_INTERNAL_DEST
, DBT_INTERNAL_SOURCE
to the partition_filter
jinja var and it looked like there was a missing set
in the {% is_partition_filter_required = config.get('require_partition_filter', false) %}
line which I added back in.
(2) And then to test the above override out, add a simple incremental model like so:
-- models/my_inc_model.sql
{{
config(
materialized = 'incremental',
incremental_strategy = 'merge',
unique_key = 'user_id',
pre_hook = "alter table if exists {{ this }} set options (require_partition_filter = false)",
post_hook = "alter table if exists {{ this }} set options (require_partition_filter = true)",
partition_by = {
"field": "updated_at",
"data_type": "date",
"granularity": "month"
},
require_partition_filter = true
)
}}
select 1 as user_id, 100 as sales, parse_date('%Y-%m-%d', '2021-01-01') as updated_at
union all
select 2 as user_id, 200 as sales, parse_date('%Y-%m-%d', '2021-02-01') as updated_at
union all
select 3 as user_id, 300 as sales, parse_date('%Y-%m-%d', '2021-03-01') as updated_at
(3) Do a dbt run --full-refresh
and inspect the table in BigQuery by selecting from it.
(4) Modify the incremental model above to simulate new incoming data like so:
-- models/my_inc_model.sql
{{
config(
materialized = 'incremental',
incremental_strategy = 'merge',
unique_key = 'user_id',
pre_hook = "alter table if exists {{ this }} set options (require_partition_filter = false)",
post_hook = "alter table if exists {{ this }} set options (require_partition_filter = true)",
partition_by = {
"field": "updated_at",
"data_type": "date",
"granularity": "month"
},
require_partition_filter = true
)
}}
select 1 as user_id, 100 as sales, parse_date('%Y-%m-%d', '2021-01-01') as updated_at
union all
select 2 as user_id, 200 as sales, parse_date('%Y-%m-%d', '2021-02-01') as updated_at
union all
select 3 as user_id, 400 as sales, parse_date('%Y-%m-%d', '2021-03-01') as updated_at
union all
select 4 as user_id, 400 as sales, parse_date('%Y-%m-%d', '2021-04-01') as updated_at
(4) Do a dbt run
and inspect the table in BigQuery by selecting from it.
Certainly not ideal with the pre/post hooks that seem to duplicate what the require_partion_filter
config would have done but could work for someone experiencing the same errors.
This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please remove the stale label or comment on the issue, or it will be closed in 7 days.
Reopening issue on behalf of an enterprise customer.
This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please remove the stale label or comment on the issue, or it will be closed in 7 days.
Reopening issue on behalf of an enterprise customer.
Hey @github-christophe-oudar, was that ever on your radar? Just wondering if you're smelling something fishy here - I trust your instincts ;)
@Fleid That's a lot of history to read!
I know there are still issues with require_partition_filter
and my workaround so far is to use copy_partitions: true
to avoid running into the MERGE related issues for insert_overwrite
mode. However for merge
mode it might be a problem...
To be honest, I think it would be interesting to create test cases in the ITs from dbt-bigquery to see what's failing and move forward with a TDD approach here as there are a lot of possibilities... (and some are thorny, eg I had an issue where it wouldn't work if the input dataset had no data to merge and it failed...)
I hope we'll be able to carve some time to make incremental processing better in the back half of the year. In the meantime we can keep this one open, and thanks for the workaround!
Describe the feature
Picking up from dbt-labs/dbt#2928, which added support for two new configs in dbt-bigquery:
require_partition_filter
andpartition_expiration_days
.Let's ensure that
require_partition_filter
works with all the permutations of incremental models on BigQuery. Anyone is welcome to pick this up as a contribution forv0.20.0
!The
merge
strategyWe need the merge condition to be
This could be accomplished by passing an additional
predicate
toget_merge_sql
here, something like:This is a bit of a hack—filtering only in this sense—but to be honest there isn't any straightforward way dbt can know in advance the specific partitions it's merging into. For that, you should use...
The
insert_overwrite
strategyThe
require_partition_filter
config works just fine with "static"insert_overwrite
strategy—when the user supplies the values in advance via thepartitions
config—which is also the most performant for updating very large datasets. (It would still be a good idea to add a test for this.)For the "dynamic"
insert_overwrite
strategy, the current error comes in step 2:https://github.com/fishtown-analytics/dbt/blob/1f927a374c8bd52a12a20d892fed9d59cffd04f4/plugins/bigquery/dbt/include/bigquery/macros/materializations/incremental.sql#L63-L68
We need to either:
where {{ partition_by.field }} is not null
, thereby satisfying the filter requirementrequire_partition_by
. Should this be a more general rule, that iftemporary = True
, thecreate table
statement shouldn't setrequire_partition_filter = True
? I think it would make good sense.Describe alternatives you've considered
require_partition_filter
for all types of incremental models. I think we definitely should!Related unsolved questions
require_partition_filter
setpartition_expiration_days
for incremental models? Personally, I'm not so sureWho will this benefit?