dbt-labs / dbt-external-tables

dbt macros to stage external sources
https://hub.getdbt.com/dbt-labs/dbt_external_tables/latest/
Apache License 2.0
288 stars 116 forks source link

Add support to infer schemas on BigQuery #249

Closed jbergeskans closed 3 months ago

jbergeskans commented 6 months ago

Describe the feature

Allow for the BigQuery plugin to infer schemas in the case that the user has chosen to document the external table for dbt docs (descriptions, constraints etc) but does not wish to also add data types for all columns.

Describe alternatives you've considered

Additional context

Currently, if a user adds column descriptions to a sources file, the package will generate a table creation query using the column information available in the sources file. If no data_type is set, the creation of the table will fail with the error message:

15:10:27  Encountered an error while running operation: Database Error
  Type not found: None at [x:y]

This is because the generated query will look like this:

create or replace external table `my_project`.`my_dataset`.`some_table_name`(

        pk_col None,
        data_col None,
        loaded_date None)

with partition columns (
        loaded_date date) 

options (
        uris = ['gs://bucket/folder/*'], format = 'parquet', hive_partition_uri_prefix = 'gs://bucket/folder/')

Which happens because of this code block

This solution is already implemented in the Snowflake plugin and I don't see that the BigQuery fix needs to be more complicated than allowing to pass in infer_schema and checking its truthiness.

Who will this benefit?

This will allow for users to infer the schema while checking descriptions and constraints in the dbt docs.

jbergeskans commented 6 months ago

Willing to create a PR of this, assuming the fix is as straightforward as it seems.

lassebenni commented 5 months ago

I second this feature!

thomas-vl commented 3 months ago

For me this feels very conflicting, you want to infer the schema automatically but do want to manually add the column names for documentation.

I see two problems with this setup:

The inferred schema in BigQuery might be different than the columns you put in manually so the documentation no longer reflects reality. What will happen when you apply a data test on a column that is not inferred in BigQuery because its removed from the parquet file?

dataders commented 3 months ago

@jbergeskans I agree that manually specifying column names and datatypes can be a pain!

I looked at #259, but the implementation is different from that of Snowflake's. For dbt-snowflake's version, if infer_schema: true then the following happens:

  1. a one-off select * from table( infer_schema( location=>'{{external.location}}', file_format=>'{{external.file_format}}') ) is executed with the results stored as columns_infer (source: macros/plugins/snowflake/create_external_table.sql#L8-L15)
  2. the results of columns_infer are iterated over to generate the correct datatypes for the CREATE EXTERNAL TABLE statement (source)

I agree with @thomas-vl and am conflicted about your implementation. However, I'm not in love with the precedent that is Snowflake's implementation either.

@jbergeskans, am I right in guessing that if the sources.yml columns names and types were generated for you, then you wouldn't need schema inference? Especially in your case where you're happy to enumerate the column names because it provides value?

What you are not looking for is something that would allow for schema evolution of an external table, as in: if actual external files have one of the below changes, then dbt shouldn't flinch and make the table accordingly:

My gut tells me that we DON'T want auto-schema evolution -- data changing silently should not be a feature right?

I've also opened https://github.com/dbt-labs/dbt-codegen/issues/173, this feels like the "right solution"™️ to me.

jbergeskans commented 3 months ago

@dataders fully agree that silently changing the schema is not something that should ever be allowed. And yes, I would be happy enough if this was generated for me so that I could easily go in and add tests / apply policy tags / add constraints where needed.

In this case what we had was a spreadsheet with column names and their descriptions. So I wrote a script to generate the yaml file based on the spreadsheet and we ran into this issue. But if I could've get the field names and data types generated for me, I'd 100% take that.