dbt-labs / dbt-external-tables

dbt macros to stage external sources
https://hub.getdbt.com/dbt-labs/dbt_external_tables/latest/
Apache License 2.0
286 stars 115 forks source link

Bigquery error for tables with metadata caching + hive partitions #303

Open kendalldyke14 opened 2 months ago

kendalldyke14 commented 2 months ago

Describe the bug

Error when creating external table in BigQuery with metadata caching, hive partitions & defined columns (specifically related to execution of bigquery__update_external_table_columns macro.)

Steps to reproduce

BigLake connection IDs are project-specific, anyone testing will need to generate a connection ID tied to their own GCP project. Follow the directions here: https://cloud.google.com/bigquery/docs/create-cloud-resource-connection#create-cloud-resource-connection

Please replace the connection_name value with your own, but no other changes will be required to this YAML.

sources:
  - name: hive_metadata_bug
    tables:
      - name: austin_bikeshare
        external:
          location: >
            gs://public-austin-bikeshare-sample-data/*
          options:
            max_staleness: INTERVAL 1 HOUR
            metadata_cache_mode: AUTOMATIC
            compression: GZIP
            connection_name: your-project-id.us.your-connection-id
            format: NEWLINE_DELIMITED_JSON
            hive_partition_uri_prefix: >
              gs://public-austin-bikeshare-sample-data/
            ignore_unknown_values: true
          partitions:
            - name: district
              data_type: INTEGER
        columns:
          - name: station_id
            data_type: INTEGER
          - name: name
            data_type: STRING
          - name: status
            data_type: STRING
          - name: address
            data_type: STRING
          - name: alternate_name
            data_type: STRING
          - name: city_asset_number
            data_type: INTEGER
          - name: property_type
            data_type: STRING
          - name: number_of_docks
            data_type: INTEGER

Run:

dbt run-operation stage_external_sources --vars "ext_full_refresh: true" --args "select: hive_metadata_bug.austin_bikeshare"

Expected results

I expect the operation completes with an exit code of 0 and the table is successfully created in BigQuery.

Actual results

The table gets created in BigQuery without issue (verified in BigQuery UI job history), but dbt logs the following error:

17:02:47  Encountered an error while running operation: Database Error
  Updating hive partition options for table gcp-project-id:hive_metadata_bug.austin_bikeshare with query acceleration is not supported. Please use DDL CREATE OR REPLACE EXTERNAL TABLE instead.

This tells us that the error is actually coming from something that happens after the staging of the tables. I've narrowed it down to the macros/plugins/bigquery/update_external_table_columns.sql macro that is actually throwing the error. If I overwrite this macro, the job completes successfully.

Screenshots and log output

This is the result I'm seeing without any changes to the macro configuration:

% dbt run-operation stage_external_sources --vars "ext_full_refresh: true" --args "select: hive_metadata_bug.austin_bikeshare"

16:42:00  Running with dbt=1.7.0
16:42:00  Registered adapter: bigquery=1.7.0
16:42:00  Unable to do partial parsing because config vars, config profile, or config target have changed
16:42:02  Found 49 models, 32 tests, 187 sources, 0 exposures, 0 metrics, 865 macros, 0 groups, 0 semantic models
16:42:02  1 of 1 START external source hive_metadata_bug.austin_bikeshare
16:42:05  1 of 1 (1) create or replace external table `gcp-project-id`.`hive_metadata_bug`.`austin_bikeshare...  
16:42:06  1 of 1 (1) None (0 processed)
16:42:07  Encountered an error while running operation: Database Error
  Updating hive partition options for table gcp-project-id:hive_metadata_bug.austin_bikeshare with query acceleration is not supported. Please use DDL CREATE OR REPLACE EXTERNAL TABLE instead.

% echo $?
1

When I overwrite the bigquery__update_external_table_columns macro to do nothing, I get a successful result:

macros/update_external_table_columns.sql:

{% macro bigquery__update_external_table_columns(source_node) %}

{% endmacro %}

Results:

% dbt run-operation stage_external_sources --vars "ext_full_refresh: true" --args "select: hive_metadata_bug.austin_bikeshare"

16:49:27  Running with dbt=1.7.0
16:49:27  Registered adapter: bigquery=1.7.0
16:49:27  Unable to do partial parsing because config vars, config profile, or config target have changed
16:49:29  Found 49 models, 32 tests, 187 sources, 0 exposures, 0 metrics, 866 macros, 0 groups, 0 semantic models
16:49:29  1 of 1 START external source hive_metadata_bug.austin_bikeshare
16:49:33  1 of 1 (1) create or replace external table `gcp-project-id`.`hive_metadata_bug`.`austin_bikeshare...  
16:49:34  1 of 1 (1) None (0 processed)

% echo $?
0

System information

The contents of your packages.yml file:

packages:
  - package: dbt-labs/dbt_utils
    version: 1.0.0

  - package: dbt-labs/codegen
    version: 0.8.1

  - package: dbt-labs/dbt_external_tables
    version: 0.9.0

  - package: calogica/dbt_expectations
    version: 0.8.5

  - package: pvcy/dbt_privacy
    version: 0.3.1

Which database are you using dbt with?

The output of dbt --version:

Core:
  - installed: 1.7.0 
  - latest:    1.7.14 - Update available!

  Your version of dbt-core is out of date!
  You can find instructions for upgrading here:
  https://docs.getdbt.com/docs/installation

Plugins:
  - bigquery: 1.7.0 - Update available!

  At least one plugin is out of date or incompatible with dbt-core.
  You can find instructions for upgrading here:
  https://docs.getdbt.com/docs/installation

The operating system you're using: MacOS 14.3.1

The output of python --version: Python 3.8.19

Additional context