I'm using delete+insert for a BigQuery task, and it's pretty expensive since it scans the whole table for deletes.
The problem here is, BigQuery can't utilize partitions when the partition is calculated dynamically as in the use case
Are you a dlt user?
Yes, I run dlt in production.
Use case
delete+insert on a column called partition_col which is a timestamp/date/datetime column.
-- DLT DELETE INSERT
-- 33 GB
DELETE FROM `project.dataset.table` AS d
WHERE EXISTS (SELECT 1 FROM `project.dataset_staging.table` AS s WHERE d.`partition_col` = s.`partition_col`);
Proposed solution
A solution to that is, assigning the partition values to a variable, and then deleting the values in that array.
-- ALTERNATIVE SOLUTION
-- 180 MB
declare partition_values array<timestamp> default (SELECT array_agg(distinct partition_col) FROM `project.dataset_staging.table` AS s);
DELETE FROM `project.dataset.table` AS d
WHERE partition_col IN unnest(partition_values);
The potential problems of this solution which are not that big are:
We need to know the data type of incremental_key (timestamp in that example)
BigQuery has 100MB limit for a single cell, and since it's an array, it can't store more than 100MB, which is very rare to happen in incremental keys.
Feature description
I'm using delete+insert for a BigQuery task, and it's pretty expensive since it scans the whole table for deletes. The problem here is, BigQuery can't utilize partitions when the partition is calculated dynamically as in the use case
Are you a dlt user?
Yes, I run dlt in production.
Use case
delete+insert on a column called
partition_col
which is a timestamp/date/datetime column.Proposed solution
A solution to that is, assigning the partition values to a variable, and then deleting the values in that array.
The potential problems of this solution which are not that big are:
Related issues
No response