databricks / dbt-databricks

A dbt adapter for Databricks.
https://databricks.com
Apache License 2.0
211 stars 112 forks source link

Persist column `meta` data in delta tables #340

Open casperdamen123 opened 1 year ago

casperdamen123 commented 1 year ago

Describe the feature

The meta key/value pairs should be persisted and pushed to the delta table when materializing dbt models.

image

Describe alternatives you've considered

The alternative is to add a custom Python model and leverage PySpark to add this metadata to a specific column df.withMetadata("column_name", {"key_1": "value_2", "key_2": value_2}). However, I would like to keep this kind of details in the .yml configuration files and avoid using additional Python models to only add metadata to columns.

Additional context

I've tried to add it to the description field since I've noticed this is persisted as a comment. However, the key name of the metadata for the column will then always be comment and all the information needs to be inputed in that same key. I would like to have several metadata keys holding different types of information.

image image

Who will this benefit?

All dbt users that leverage the power of metadata on delta tables in Databricks.

benc-db commented 1 year ago

@andrefurlan-db would appreciate your thoughts on this one.

benc-db commented 11 months ago

@casperdamen123 did some research and one option is to specify your metadata as table properties: https://docs.databricks.com/en/delta/custom-metadata.html#store-custom-tags-in-table-properties

they won't be column scoped in your yml file but you could do something like this:

models:
  - name: model_name
    config:
       tblproperties:
         - "key": "value"

where key could be something like columnName.propertyName

casperdamen123 commented 11 months ago

@benc-db that looks promising. This is already implemented in dbt-databricks at the moment or do we need to build this?

benc-db commented 11 months ago

Should work already. Let me know if you hit any issues.

casperdamen123 commented 11 months ago

@benc-db did some testing but it doesn't seem to work just yet.

version: 2

models:
  - name: test
    description: "This is a test"

    config: 
      tblproperties:
         - "test.type": "geometry"

    columns:
      - name: test
        description: "test column"

Results in:

06:12:28            create or replace table `gima_test`.`test`.`test`
06:12:28          
06:12:28          
06:12:28        using delta
06:12:28          
06:12:28          
06:12:28          
06:12:28          
06:12:28          
06:12:28          comment 'This is a test'
06:12:28          
06:12:28          
06:12:28        tblproperties ('{'test.type': 'geometry'}' = '' 
06:12:28    --------------------------^^^
06:12:28        )
06:12:28          as
06:12:28          
06:12:28    select "test" as test

When not using the table.column notation

version: 2

models:

  - name: test
    description: "This is a test"

    config: 
      tblproperties:
         - "some_key": "geometry"

    columns:
      - name: test
        description: "test column"

Results in:

06:14:30            create or replace table `gima_test`.`test`.`test`
06:14:30          
06:14:30          
06:14:30        using delta
06:14:30          
06:14:30          
06:14:30          
06:14:30          
06:14:30          
06:14:30          comment 'This is a test'
06:14:30          
06:14:30          
06:14:30        tblproperties ('{'some_key': 'geometry'}' = '' 
06:14:30    ------------------------------^^^
06:14:30        )
06:14:30          as
06:14:30          
06:14:30    select "test" as test
benc-db commented 11 months ago

hmm, well I know we're using it in successfully in the model files: https://github.com/databricks/dbt-databricks/blob/3609a6ed5375848ea942036b2735fb6efc76101c/tests/integration/set_tblproperties/models/set_tblproperties.sql#L4. Could you try that and let me know if you get the same error?

benc-db commented 11 months ago

Ok, so here's what's going on...it's yaml being weird. if you specify like this:

tblproperties: { "column.meta": "value" }

it should work (as opposed to the yaml expression of a dictionary).

I'll see if I can get to the bottom of this, but in the mean time, can you try that work around?

benc-db commented 11 months ago

- "some_key": "geometry" I think this is the issue...sets the value of tblproperties to be an array with a dictionary as an entry, instead of:

tblproperties:
   "some_key" : "geometry"

my mistake for giving you bad yaml.

benc-db commented 11 months ago

I'm going to update the docs on dbt to include info on how to set tblproperties.

casperdamen123 commented 11 months ago

Tried to use the following as per your example

version: 2

models:

  - name: test
    description: "This is a test"

    config: 
      tblproperties:
         "test.test": "geometry"

    columns:
      - name: test
        description: "test column"

The model run passes but the metadata for the test column is not populated

image

csimplestring commented 8 months ago

Hi @benc-db , the solution you proposed does not work. using tblproperties: { "column.meta": "value" } nothing populated in Data Explorer UI

github-actions[bot] commented 2 months ago

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please remove the stale label or comment on the issue.

casperdamen123 commented 2 months ago

keep alive