Retrieve column and table comments for Redshift tables when running generate_source() with include_descriptions = True

johnson-jay-l commented 1 year ago

Describe the feature

I am trying to use generate_source() to create source schema.yml files, and persist any description metadata in the Redshift database. We use this both to create the initial schema, and to update existing schema.yml's when columns are added or dropped. We spend too much effort maintaining code to "override" the generated descriptions with override files.

Currently, when running generate_source() with generate_columns=True and include_descriptions=True, the Description tags in the yaml are placeholders containing empty strings. If a file already exists, we have to use our custom merging logic to update the descriptions with the actual known values. Then we use the dbt docs to publish this info to users. Here is an example with the args used to generate the source schema yaml with descriptions: dbt run-operation generate_source --args '{"generate_columns": "True", "include_descriptions": "True", "schema_name": "some_schema", "database_name": "some_db"}'

I would like for generate_source() to also pull table and column metadata from Redshift comment fields, and include it in the generated yaml instead of the empty strings that are currently defaulted in for the descriptions.

Then separately in dbt-core, I would like to have the persist_docs config extended to support dbt sources so that the source schema.yaml's include any manually overridden descriptions. Those descriptions would be checked into our dbt git repo, deployed, and written back to the db. (To do this would need a separate PR in the dbt-core repo or our own custom logic).

This would enable a closed-loop workflow that looks like this:

Run generate_source() to pull the latest table/column description metadata from the Redshift cluster and generate the yaml files
Modify any table or column description values in the yaml files
Commit -> push -> build. During the build we generate the dbt docs
dbt docs uses the persist_docs config to write the description data to Redshift comments (requires another PR in that repo)
repeat for any future description changes

Describe alternatives you've considered

Scripting out a cli tool with bash + python + macros to maintain manual "override" files for each generated yaml file. The manual files are merged into the generated yaml files before checking them in and rendering them in the dbt docs.

It is too much work to maintain this, and it would be very convenient for generate_source() to do this from one command.

Additional context

I am interested mostly in Redshift. I am not sure what approach other databases take to persist table and column descriptions. But it could probably be done for other databases too.

Here is an example query to get the table and column descriptions from Redshift:

WITH tables AS (
    SELECT c.oid,
           ns.nspname as schema_name,
           c.relname as table_name,
           d.description as table_description
    FROM pg_catalog.pg_class AS c
    JOIN pg_catalog.pg_namespace AS ns
      ON c.relnamespace = ns.oid
    LEFT JOIN pg_catalog.pg_description d
      on c.oid = d.objoid
     and d.objsubid = 0
     WHERE ns.nspname not in ('pg_catalog')
)
SELECT t.oid,
       c.table_schema as schema_name,
       c.table_name AS table_name,
       t.table_description,
       c.column_name AS column_name,
       c.ordinal_position,
       d.description as column_description
from tables t
join information_schema.columns c
    on c.table_schema = t.schema_name
   and c.table_name = t.table_name
left join pg_catalog.pg_description d
   ON d.objoid = t.oid
  AND d.objsubid = c.ordinal_position
  AND d.objsubid > 0
where 1=1
and coalesce(table_description, column_description) is not null
order by
    t.schema_name,
    t.table_name,
    c.ordinal_position

Who will this benefit?

Integrating table and column description metadata into an end-to-end deployment workflow that includes the dbt docs. Using the dbt schema.yml's for sources as an interface to maintaining the db metadata; with the db holding the source of truth.
Persisting table and column descriptions in the database based on info in the schema.yml's
Defaulting the table and column descriptions in the schema.yml's from the latest values in the database

Are you interested in contributing this feature?

I am open to contributing but may need some guidance for testing, and help with the related changes on the dbt-core side to enable persist_docs for sources. But I am not sure what kind of effort persist_docs would take for sources vs writing our own logic to sync yaml descriptions with our db.

Even without the related dbt-core changes we would still get a lot of value from the dbt-codegen changes as described above.

github-actions[bot] commented 1 year ago

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days.

github-actions[bot] commented 1 year ago

Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest. Just add a comment to notify the maintainers.

dbt-labs / dbt-codegen