dataform-co / dataform

Dataform is a framework for managing SQL based data operations in BigQuery
https://cloud.google.com/dataform/docs
Apache License 2.0
854 stars 166 forks source link

Assertions fail on partitioned BigQuery tables with `requirePartitionFilter` enabled #1622

Open RyuSA opened 11 months ago

RyuSA commented 11 months ago

Problem Summary:

I found that assertions for SQLX scripts fail with the following error when config.bigquery.requirePartitionFilter: true:

Cannot query over table 'dataform.repro' without a filter over column(s) 'date' that can be used for partition elimination

Minimal Reproduction:

config {
  type: "incremental", // or "table"
  bigquery: {
    partitionBy: "date",
    requirePartitionFilter: true,
  },
  assertions: {
    uniqueKey: "id"
  }
}

SELECT
  CURRENT_DATE('Japan') AS date,
  1 AS id,

The error message originates from BigQuery. I found BigQueryAdaptor uses Adaptor.indexAssertion for the assertion, ignoring BigQueryOptions.

This means the generated assertion SQL does not contain filters for partition columns.

Problem Details:

This is the assertion SQL for the minimal reproduction.

SELECT
  *
FROM (
  SELECT
    id,
    COUNT(1) AS index_row_count
  FROM `project.dataform.repro`
  GROUP BY id
  // missing a filter for column "id"
  ) AS data
WHERE index_row_count > 1

The SQL is generated by Adaptor.indexAssertion and it is missing the filter. This is because that BigQueryAdaptor{.indexAssertion || .rowConditionsAssertion} disregards BigQueryOptions during assertions, causing partition settings to be neglected(=assertion query does not contain partition clause.).

BigQueryAdaptor should have implementations for the methods, like below.

export class BigQueryAdapter extends Adapter implements IAdapter {

    // for uniqueKey
    public indexAssertion(dataset: string, indexCols: string[]) {
        // do something to filter partition columns
        const partitionColumnStatement = ...

        return `
SELECT
  *
FROM (
  SELECT
    ${commaSeparatedColumns},
    COUNT(1) AS index_row_count
  FROM ${dataset}
  GROUP BY ${commaSeparatedColumns}
  AND ${partitionColumnStatement}
  ) AS data
WHERE index_row_count > 1 
`
    }
}
cwdjankoski commented 1 month ago

I just hit the same problem with equivalent context - a bigquery partitioned table and assertions from within the file config block. the assertions are failing due to the wrong sql that is auto-generated.

my solution is to avoid using the built-in assertions from within the config block as you cannot rely on the sql it generates.

it's been quite some time since this was reported - i was wondering if there were any developments in the meantime ? @Ekrekr thank you in advance for your thoughts.