Closed katieclaiborne closed 7 months ago
I was looking for a way in dbt to add table and column descriptions to external tables in BigQuery as well. The current implementation of this package does not consider descriptions at all (not only for BigQuery).
The BigQuery DDL supports descriptions for external tables:
CREATE EXTERNAL TABLE `project`.`dataset`.`table_name` (
order_id INTEGER OPTIONS(description = "Unique identifier of the order (primary key).")
)
OPTIONS (
format = 'PARQUET',
uris = ['gs://gcs-name/orders.parquet'],
description = "Contains all order line items."
);
In a naive approach, the stage_external_sources
operation could always add descriptions to external tables, when they are set in the source properties.
The following mock code changes might illustrate this approach:
create or replace external table {{source(source_node.source_name, source_node.name)}}
{%- if columns -%}(
{% for column in columns %}
{%- set column_quoted = adapter.quote(column.name) if column.quote else column.name %}
+ {%- set column_options = 'options(description = "' + column.description + '")' if 'description' in column else '' %}
- {{column_quoted}} {{column.data_type}} {{- ',' if not loop.last -}}
+ {{column_quoted}} {{column.data_type}} {{column_options}} {{- ',' if not loop.last -}}
{%- endfor -%}
)
options (
uris = [{%- for uri in uris -%} '{{uri}}' {{- "," if not loop.last}} {%- endfor -%}]
+ {%- if 'description' in external %}
+ , description = "{{external.description}}"
+ {%- endif -%}
{%- if options is mapping -%}
Another approach would be to add the persist_docs
option for sources in dbt-core first and then build the feature on top of it in this package. I'm just not quite sure, whether it makes sense to add this option to dbt-core, because dbt-core won't make use of it at all. The sole purpose would be to enable persistence for descriptions in the dbt-external-tables
dbt package.
Do you have any thoughts on it? Just add descriptions always, or consider (currently not existing) config options?
I've started on a fix setting descriptions on columns in this PR: https://github.com/dbt-labs/dbt-external-tables/pull/252
Another approach would be to add the persist_docs option for sources in dbt-core first and then build the feature on top of it in this package. I'm just not quite sure, whether it makes sense to add this option to dbt-core, because dbt-core won't make use of it at all. The sole purpose would be to enable persistence for descriptions in the dbt-external-tables dbt package.
Do you have any thoughts on it? Just add descriptions always, or consider (currently not existing) config options?
@LoHertel, I agree with your instinct that persisting source table descriptions is specific to the dbt-external-tables package, and would lean toward building the capability here rather than in dbt-core. External source tables seem like an exception to the rule, in that they can be configured within and produced by dbt. All other source tables are simply taken as inputs.
My PR fixes it in the same way as it works on the Provider without additional configuration.
resolved by: #252
@katieclaiborne do you want to get the new beta release a try?
@dataders, new handle, same me!
Apologies for the delay, but I've confirmed that external table column descriptions are persisted in BigQuery on version 0.9.0. Thanks again for this work!
@dataders, awesome, that column descriptions for BigQuery are available now 🚀
This issue addresses table descriptions as well, which have not been introduced by #252. Therefore, I would like to ask to reopen this issue, because it is not fully resolved.
By table description I mean the string, which describes a source table in a yaml property file:
version: 2
sources:
- name: olist
tables:
- name: orders
description: |
Contains the order head information.
Each order could contain multiple line items, which are stored in the order_items table.
external:
location: 'gs://gcs-name/orders.parquet/*'
options:
format: 'PARQUET'
hive_partition_uri_prefix: 'gs://gcs-name/orders.parquet'
require_hive_partition_filter: false
partitions:
- name: year
data_type: INTEGER
columns:
- name: order_id
description: Unique identifier of the order (primary key).
data_type: STRING
It would be great, if the table description is added to the OPTIONS
block at the end of the CREATE EXTERNAL TABLE
statement for the BigQuery dialect:
CREATE EXTERNAL TABLE `project`.`dataset`.`orders` (
order_id STRING OPTIONS(description = "Unique identifier of the order (primary key).")
)
WITH PARTITION COLUMNS (
year INTEGER
)
OPTIONS (
uris = ['gs://gcs-name/orders.parquet/*'],
format = 'PARQUET',
hive_partition_uri_prefix = 'gs://gcs-name/orders.parquet',
require_hive_partition_filter = false,
description = 'Contains the order head information.\nEach order could contain multiple line items, which are stored in a seperate table.'
);
The stage_external_sources
operation would need to check for the existence of a table description in the dbt source properties and append it to the OPTIONS
SQL block.
Would this be a sensible way of implementing it? I could contribute here.
Describe the feature
If source table and column descriptions are configured in dbt, I'd like for them to be persisted in the database when the
stage_external_sources
operation is run.Describe alternatives you've considered
Configuring source table and column descriptions does make them visible in dbt Docs and dbt Explorer, which is helpful.
Additional context
We're working in BigQuery, but I could imagine similar use cases in other databases.
I'm not sure if this is a known limitation, given that the persist_docs config is not implemented for sources. It could also be that we've just missed a step in configuring and staging our external sources!
Who will this benefit?
This feature will be helpful for users discovering sources through the database, rather than dbt Docs or dbt Explorer.