Closed ktchana closed 1 month ago
Looking at the Bigquery query logs, it looks like the connector is trying to execute the following code block:
DECLARE
partitions_to_delete DEFAULT (
SELECT
ARRAY_AGG(DISTINCT(TIMESTAMP_TRUNC(`txndate`, DAY)) IGNORE NULLS)
FROM
`spark_test.mytable2316280012714`);
MERGE
`spark_test.mytable` AS TARGET
USING
`spark_test.mytable2316280012714` AS SOURCE
ON
FALSE
WHEN NOT MATCHED BY SOURCE AND TIMESTAMP_TRUNC(`target`.`txndate`, DAY) IN UNNEST(partitions_to_delete) THEN DELETE
WHEN NOT MATCHED BY TARGET
THEN
INSERT
(`id`,
`name`,
`txndate`)
VALUES
(`id`,`name`,`txndate`);
in which the partitioned column txndate
is referenced with a call to the TIMESTAMP_TRUNC
function. I believe this is what makes the BQ optimizer reject the query.
A potential workaround for date partitioned table would be to change the above block into something like this:
DECLARE
partitions_to_delete DEFAULT (
SELECT
ARRAY_AGG(DISTINCT(DATE(TIMESTAMP_TRUNC(`txndate`, DAY))) IGNORE NULLS)
FROM
`spark_test.mytable2316280012714`);
MERGE
`spark_test.mytable` AS TARGET
USING
`spark_test.mytable2316280012714` AS SOURCE
ON
FALSE
WHEN NOT MATCHED BY SOURCE AND `target`.`txndate` IN UNNEST(partitions_to_delete) THEN DELETE
WHEN NOT MATCHED BY TARGET
THEN
INSERT
(`id`,
`name`,
`txndate`)
VALUES
(`id`,`name`,`txndate`);
Any chance this could be fixed in the next connector version?
A Bigquery table is created like this:
The following Scala spark code is used to populate the table with dummy data:
This works as expected (i.e., table partitions included in the Dataframe are overwritten). However, when we alter the table to enable the
require_partition_filter
option:The same
df.write
call would fail complaining about the lack of partition filter in the query:Tested with the following
Spark 3.1.3 on Dataproc Scala 2.12 BigQuery Connector: spark-bigquery-with-dependencies_2.12-0.40.0.jar