dbt-labs / dbt-external-tables

dbt macros to stage external sources
https://hub.getdbt.com/dbt-labs/dbt_external_tables/latest/
Apache License 2.0
288 stars 116 forks source link

Persist source table and column descriptions #248

Closed katieclaiborne closed 2 months ago

katieclaiborne commented 6 months ago

Describe the feature

If source table and column descriptions are configured in dbt, I'd like for them to be persisted in the database when the stage_external_sources operation is run.

Describe alternatives you've considered

Configuring source table and column descriptions does make them visible in dbt Docs and dbt Explorer, which is helpful.

Additional context

We're working in BigQuery, but I could imagine similar use cases in other databases.

I'm not sure if this is a known limitation, given that the persist_docs config is not implemented for sources. It could also be that we've just missed a step in configuring and staging our external sources!

Who will this benefit?

This feature will be helpful for users discovering sources through the database, rather than dbt Docs or dbt Explorer.

LoHertel commented 5 months ago

I was looking for a way in dbt to add table and column descriptions to external tables in BigQuery as well. The current implementation of this package does not consider descriptions at all (not only for BigQuery).

The BigQuery DDL supports descriptions for external tables:

CREATE EXTERNAL TABLE `project`.`dataset`.`table_name` (
  order_id INTEGER OPTIONS(description = "Unique identifier of the order (primary key).")
)
OPTIONS (
  format = 'PARQUET',
  uris = ['gs://gcs-name/orders.parquet'],
  description = "Contains all order line items."
);

In a naive approach, the stage_external_sources operation could always add descriptions to external tables, when they are set in the source properties.

The following mock code changes might illustrate this approach:

Add column descriptions

https://github.com/dbt-labs/dbt-external-tables/blob/21428bce9ea3aed7ac6900d11b158bbe80fb56b0/macros/plugins/bigquery/create_external_table.sql#L19-L25

         create or replace external table {{source(source_node.source_name, source_node.name)}} 
                 {%- if columns -%}(
                 {% for column in columns %}
                     {%- set column_quoted = adapter.quote(column.name) if column.quote else column.name %}
+                    {%- set column_options = 'options(description = "' + column.description + '")' if 'description' in column else '' %}
-                    {{column_quoted}} {{column.data_type}} {{- ',' if not loop.last -}}
+                    {{column_quoted}} {{column.data_type}} {{column_options}} {{- ',' if not loop.last -}}
                 {%- endfor -%}
                 )

Add table description

https://github.com/dbt-labs/dbt-external-tables/blob/21428bce9ea3aed7ac6900d11b158bbe80fb56b0/macros/plugins/bigquery/create_external_table.sql#L37-L48

         options ( 
             uris = [{%- for uri in uris -%} '{{uri}}' {{- "," if not loop.last}} {%- endfor -%}] 
+            {%- if 'description' in external %} 
+            , description = "{{external.description}}"
+            {%- endif -%} 
             {%- if options is mapping -%} 

Another approach would be to add the persist_docs option for sources in dbt-core first and then build the feature on top of it in this package. I'm just not quite sure, whether it makes sense to add this option to dbt-core, because dbt-core won't make use of it at all. The sole purpose would be to enable persistence for descriptions in the dbt-external-tables dbt package.

Do you have any thoughts on it? Just add descriptions always, or consider (currently not existing) config options?

thomas-vl commented 5 months ago

I've started on a fix setting descriptions on columns in this PR: https://github.com/dbt-labs/dbt-external-tables/pull/252

katieclaiborne commented 5 months ago

Another approach would be to add the persist_docs option for sources in dbt-core first and then build the feature on top of it in this package. I'm just not quite sure, whether it makes sense to add this option to dbt-core, because dbt-core won't make use of it at all. The sole purpose would be to enable persistence for descriptions in the dbt-external-tables dbt package.

Do you have any thoughts on it? Just add descriptions always, or consider (currently not existing) config options?

@LoHertel, I agree with your instinct that persisting source table descriptions is specific to the dbt-external-tables package, and would lean toward building the capability here rather than in dbt-core. External source tables seem like an exception to the rule, in that they can be configured within and produced by dbt. All other source tables are simply taken as inputs.

thomas-vl commented 5 months ago

My PR fixes it in the same way as it works on the Provider without additional configuration.

dataders commented 2 months ago

resolved by: #252

@katieclaiborne do you want to get the new beta release a try?

katieclaiborne7 commented 2 months ago

@dataders, new handle, same me!

Apologies for the delay, but I've confirmed that external table column descriptions are persisted in BigQuery on version 0.9.0. Thanks again for this work!

LoHertel commented 1 month ago

@dataders, awesome, that column descriptions for BigQuery are available now 🚀

This issue addresses table descriptions as well, which have not been introduced by #252. Therefore, I would like to ask to reopen this issue, because it is not fully resolved.

By table description I mean the string, which describes a source table in a yaml property file:

version: 2

sources:
  - name: olist
    tables:
      - name: orders
        description: |
          Contains the order head information.
          Each order could contain multiple line items, which are stored in the order_items table.

        external:
          location: 'gs://gcs-name/orders.parquet/*'
          options:
            format: 'PARQUET'
            hive_partition_uri_prefix: 'gs://gcs-name/orders.parquet'
            require_hive_partition_filter: false
          partitions:
            - name: year
              data_type: INTEGER

        columns:
          - name: order_id
            description: Unique identifier of the order (primary key).
            data_type: STRING

It would be great, if the table description is added to the OPTIONS block at the end of the CREATE EXTERNAL TABLE statement for the BigQuery dialect:

CREATE EXTERNAL TABLE `project`.`dataset`.`orders` (
  order_id STRING OPTIONS(description = "Unique identifier of the order (primary key).")
)
WITH PARTITION COLUMNS (
  year INTEGER
)
OPTIONS (
  uris = ['gs://gcs-name/orders.parquet/*'],
  format = 'PARQUET',
  hive_partition_uri_prefix = 'gs://gcs-name/orders.parquet',
  require_hive_partition_filter = false,
  description = 'Contains the order head information.\nEach order could contain multiple line items, which are stored in a seperate table.'
);

The stage_external_sources operation would need to check for the existence of a table description in the dbt source properties and append it to the OPTIONS SQL block.

Would this be a sensible way of implementing it? I could contribute here.