dlt-hub / dlt

data load tool (dlt) is an open source Python library that makes data loading easy 🛠️
https://dlthub.com/docs
Apache License 2.0
2.65k stars 176 forks source link

Optimize BigQuery Partitioned Table Incremental Queries #2039

Open sabrikaragonen opened 1 week ago

sabrikaragonen commented 1 week ago

Feature description

I'm using delete+insert for a BigQuery task, and it's pretty expensive since it scans the whole table for deletes. The problem here is, BigQuery can't utilize partitions when the partition is calculated dynamically as in the use case

Are you a dlt user?

Yes, I run dlt in production.

Use case

delete+insert on a column called partition_col which is a timestamp/date/datetime column.

-- DLT DELETE INSERT
-- 33 GB
DELETE FROM `project.dataset.table` AS d 
WHERE EXISTS (SELECT 1 FROM `project.dataset_staging.table` AS s WHERE d.`partition_col` = s.`partition_col`);

Proposed solution

A solution to that is, assigning the partition values to a variable, and then deleting the values in that array.

-- ALTERNATIVE SOLUTION
-- 180 MB
declare partition_values array<timestamp> default (SELECT array_agg(distinct partition_col) FROM `project.dataset_staging.table` AS s);

DELETE FROM `project.dataset.table` AS d 
WHERE partition_col IN unnest(partition_values);

The potential problems of this solution which are not that big are:

Related issues

No response