Closed johnson-jay-l closed 1 year ago
This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days.
Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest. Just add a comment to notify the maintainers.
Describe the feature
I am trying to use
generate_source()
to create source schema.yml files, and persist any description metadata in the Redshift database. We use this both to create the initial schema, and to update existing schema.yml's when columns are added or dropped. We spend too much effort maintaining code to "override" the generated descriptions with override files.Currently, when running
generate_source()
with generate_columns=True and include_descriptions=True, the Description tags in the yaml are placeholders containing empty strings. If a file already exists, we have to use our custom merging logic to update the descriptions with the actual known values. Then we use the dbt docs to publish this info to users. Here is an example with the args used to generate the source schema yaml with descriptions:dbt run-operation generate_source --args '{"generate_columns": "True", "include_descriptions": "True", "schema_name": "some_schema", "database_name": "some_db"}'
I would like for
generate_source()
to also pull table and column metadata from Redshift comment fields, and include it in the generated yaml instead of the empty strings that are currently defaulted in for the descriptions.Then separately in dbt-core, I would like to have the
persist_docs
config extended to support dbt sources so that the source schema.yaml's include any manually overridden descriptions. Those descriptions would be checked into our dbt git repo, deployed, and written back to the db. (To do this would need a separate PR in the dbt-core repo or our own custom logic).This would enable a closed-loop workflow that looks like this:
generate_source()
to pull the latest table/column description metadata from the Redshift cluster and generate the yaml filespersist_docs
config to write the description data to Redshift comments (requires another PR in that repo)Describe alternatives you've considered
Scripting out a cli tool with bash + python + macros to maintain manual "override" files for each generated yaml file. The manual files are merged into the generated yaml files before checking them in and rendering them in the dbt docs.
It is too much work to maintain this, and it would be very convenient for
generate_source()
to do this from one command.Additional context
I am interested mostly in Redshift. I am not sure what approach other databases take to persist table and column descriptions. But it could probably be done for other databases too.
Here is an example query to get the table and column descriptions from Redshift:
Who will this benefit?
Are you interested in contributing this feature?
I am open to contributing but may need some guidance for testing, and help with the related changes on the dbt-core side to enable
persist_docs
for sources. But I am not sure what kind of effortpersist_docs
would take for sources vs writing our own logic to sync yaml descriptions with our db.Even without the related dbt-core changes we would still get a lot of value from the dbt-codegen changes as described above.