databricks / dbt-databricks

A dbt adapter for Databricks.
https://databricks.com
Apache License 2.0
226 stars 119 forks source link

Liquid cluster columns are updated on every run, even when there is no change #802

Open krifra1234 opened 2 months ago

krifra1234 commented 2 months ago

Describe the bug

When using liquid clustering the cluster columns in the deltalake table are update every time the dbt model is ran, even if the cluster columns are not changed in the config.

Steps To Reproduce

Create a DBT model with the follwoing config: materialized= 'incremental', incremental_strategy= 'append', liquid_clustered_by= ['columnname']

Run the dbt model multiple times and look for the operation "CLUSTER BY" in the deltalake table history. Fine the column “Operation Parameters” and you will see something similar to this for every run:

{ "oldClusteringColumns": "columnname", "newClusteringColumns": "columnname" }

Expected behavior

I would expect the CLUSTER BY operation not to run when the cluster columns are not changed.

Screenshots and log output

If applicable, add screenshots or log output to help explain your problem.

System information

The output of dbt --version:

(dbt_1.8.5) PS C:\repo\dbt-bitechno> dbt --version
Core:
  - installed: 1.8.5
  - latest:    1.8.6 - Update available!

  Your version of dbt-core is out of date!
  You can find instructions for upgrading here:
  https://docs.getdbt.com/docs/installation

Plugins:
  - databricks: 1.8.5 - Update available!
  - spark:      1.8.0 - Up to date!

  At least one plugin is out of date or incompatible with dbt-core.
  You can find instructions for upgrading here:
  https://docs.getdbt.com/docs/installation

The operating system you're using: Microsoft Windows 11 Enterpris

The output of python --version: Python 3.10.14

Additional context

Add any other context about the problem here.

benc-db commented 1 month ago

Thanks for reporting. Need to rethink this.

sundeep1687 commented 1 month ago

Hi I am also facing the same issue, also another effect of the same is after the alter table cluster by, it's running "optimize table " every day. it would be great if you can prioritize this.

benc-db commented 1 month ago

@sundeep1687 you can skip optimize by setting DATABRICKS_SKIP_OPTIMIZE=true

benc-db commented 1 month ago

@krifra1234 is the alter operation slow? The alternative is querying for metadata to decide whether to do it or not, and that is not particularly fast.

sundeep1687 commented 1 month ago

Alter option is almost instant, but the optimize after is taking a long time you mean something like this in config or if you have an example please share {{ config( materialized="incremental", incremental_strategy='replace_where', incremental_predicates =[lookback_predicate], liquid_clustered_by = 'request_date_local', DATABRICKS_SKIP_OPTIMIZE =true, tags=["cvs"] ) }}

benc-db commented 1 month ago

I mean set it as an environment variable. It might also work as a dbt variable, but not as config.

saadmansoorhbo commented 1 month ago

Hi @benc-db - Can you please confirm that this DATABRICKS_SKIP_OPTIMIZE=true as a dbt var should not mess with table property autoOptimize.optimizeWrite=true? I understand that autoOptimize kicks in at write and will probably work as-is but wanted to 100% sure. We would like to avoid explicit optimize on the table if autoOptimize is working.

benc-db commented 1 month ago

that variable only affects whether we call optimize explicitly.