databricks / dbt-databricks

A dbt adapter for Databricks.
https://databricks.com
Apache License 2.0
214 stars 115 forks source link

Databricks truncates datatypes returned via `DESCRIBE EXTENDED` which is used by get_columns_in_relation() #779

Open ShaneMazur opened 1 month ago

ShaneMazur commented 1 month ago

Describe the bug

Couldn't tell you the full impact of this bug but where I encountered it was while using on_schema_change="sync_all_columns".

Basically the bug led to truncated results that feed queries involved in handling alter statements when there are data type changes in a dataset.


Current Behaviour

This is because running the below truncates the data types

DESCRIBE EXTENDED <catalog>.<schema>.<table>

Truncated field example using DESCRIBE EXTENDED

struct<_info:struct<fieldA:string,fieldB:string>,fieldC:bigint,fieldD:string>,... 78 more fields>

Requested Behaviour

Ideally dbt databricks instead uses the below to acquire that information as it does not truncate data types

select
    column_name,
    full_data_type,
    comment
from <catalog>.information_schema.columns
where table_schema = <schema> and table_name = <table>

Steps To Reproduce

  1. Have a very complex (long datatype) struct field in your dataset
  2. Run any operation in dbt-databricks that looks up the datatype of that field via get_columns_in_relation()
  3. You will observe the struct field you created has truncated datatype

Expected Behaviour

  1. Have a very complex (long datatype) struct field in your dataset
  2. Run any operation in dbt-databricks that looks up the datatype of that field via get_columns_in_relation()
  3. You will observe the struct field you created does not have a truncated datatype

Screenshots and log output

If applicable, add screenshots or log output to help explain your problem.

System information

Core:
  - installed: 1.8.5
  - latest:    1.8.5 - Up to date!

Plugins:
  - databricks: 1.8.5 - Up to date!
  - spark:      1.8.0 - Up to date!

Additional context

benc-db commented 1 week ago

Thanks for reporting, will investigate

benc-db commented 1 week ago

Need to reopen the issue. I tried to implement the suggested fix and discovered that there is often sync latency between UC and Delta that causes the information_schema to be out of date. I can fix that issue by forcing sync, but only if the table is delta; The fix is more complicated that I originally implemented, so reopening this issue.