Add support for bigquery schema inference

jbergeskans commented 4 months ago

Description & motivation

Fixes #249 to support schema inference on BigQuery.

Currently, if the column names are defined in the sources file but the data types are not, the following SQL will be generated:

create or replace external table `my_project`.`my_dataset`.`some_table_name`(

        pk_col None,
        data_col None,
        loaded_date None)

with partition columns (
        loaded_date date) 

options (
        uris = ['gs://bucket/folder/*'], format = 'parquet', hive_partition_uri_prefix = 'gs://bucket/folder/')

In order for BigQuery to infer the schema, the column names and data types needs to be omitted:

create or replace external table `my_project`.`my_dataset`.`some_table_name`

with partition columns (
        loaded_date date) 

options (
        uris = ['gs://bucket/folder/*'], format = 'parquet', hive_partition_uri_prefix = 'gs://bucket/folder/')

This has been achieved by introducing the variable infer_schema. When set to true, it will not iterate over the list of columns. Example source file

sources:
  - name: my_source
    project: my_project
    dataset: my_dataset
    loader: dbt_external_tables
    loaded_at_field: loaded_date
    tables:
      - name: infered_table
        external:
          location: "gs://bucket/folder/*"
          infer_schema: true
          options:
            format: parquet
            hive_partition_uri_prefix: "gs://bucket/folder/"
          partitions:
            - name: loaded_date
              data_type: date
        columns:
          - name: id
            description: my id
          - name: value
            description: my value

Checklist

[x] I have verified that these changes work locally
[ ] I have updated the README.md (if applicable)
[ ] I have added an integration test for my fix/feature (if applicable)

thomas-vl commented 4 months ago

@jbergeskans why would you list the columns if you do not want to explicitly set them? You can leave the columns array blank and achieve the same result?

jbergeskans commented 3 months ago

@jbergeskans why would you list the columns if you do not want to explicitly set them? You can leave the columns array blank and achieve the same result?

We want to use other documentation features such as description, tests, and constraints. Basically, this allows us to omit the data type field which, when you're using parquet files, isn't needed anyway.

thomas-vl commented 3 months ago

We want to use other documentation features such as description, tests, and constraints. Basically, this allows us to omit the data type field which, when you're using parquet files, isn't needed anyway.

For me this feels very conflicting, you want to infer the schema automatically but do want to manually add the column names for documentation.

I see two problems with this setup:

The inferred schema in BigQuery might be different than the columns you put in manually so the documentation no longer reflects reality.
What will happen when you apply a data test on a column that is not inferred in BigQuery because its removed from the parquet file?

dbt-labs / dbt-external-tables

Add support for bigquery schema inference #259

Description & motivation

Checklist