dbt-labs / dbt-codegen

Macros that generate dbt code
https://hub.getdbt.com/dbt-labs/codegen/latest/
Apache License 2.0
459 stars 99 forks source link

Generate Source, but for external files #173

Open dataders opened 4 months ago

dataders commented 4 months ago

Describe the feature

From https://github.com/dbt-labs/dbt-external-tables/issues/249#issuecomment-2048321896

If I have to add an external table as a source in dbt, I use the dbt-external-tables package.

Most platforms offer an infer_schema choice where can query a file in a given location without specifying any column names or types (see below example). This is convenient, but not necessarily dbt-onic, where we normally document all of our source columns.

With https://github.com/dbt-labs/dbt-external-tables/pull/211, dbt-external-tables now offers users an infer_schema config, that will, for a given external table defined in sources.yml, first fetch the column names and types, then use the metadata to populate the CREATE EXTERNAL TABLE statement. The problem with this approach is that while the user can then refer within a model to external table with {{ source()}}, the sources.yml is still missing important column information.

I propose instead of the introspective schema-inferring query that dbt-external-tables does to template the DDL, a more robust approach (imho) would be to use codegen.

  1. Invoke codegen to generate YAML for the external table columns (names and types)
  2. Add information to sources.yml
  3. invoke dbt-external-tables to create the defined external table (without need to specify infer_schema: True
select * from table(
    infer_schema(
        location=>'{{external.location}}',
        file_format=>'{{external.file_format}}'
    )
)