databricks / dbt-databricks

A dbt adapter for Databricks.
https://databricks.com
Apache License 2.0
226 stars 119 forks source link

Runs on 1.8 are erroring due to `apply_liquid_clustered_cols` logic #805

Closed jeremyyeo closed 1 month ago

jeremyyeo commented 1 month ago

Describe the bug

Opening on behalf of customer.

Due to the introduction of https://github.com/databricks/dbt-databricks/blob/49b9c80492274e922b41d54abe6ae5eb3888e047/dbt/include/databricks/macros/relations/liquid_clustering.sql#L24 in 1.8 - this broke some customers runs of tables that they had created outside of dbt.

Steps To Reproduce

  1. Create a table with partition outside of dbt:
create or replace table dev.dbt_jyeo.my_inc (id int, updated_at date, is_active int) partitioned by (updated_at);
  1. Add model to dbt:
-- models/my_inc.sql
{{ config(materialized='incremental') }}

select 1 id, '1970-01-01'::date as updated_at, 1 as is_active
# dbt_project.yml
name: my_dbt_project
profile: all
config-version: 2
version: "1.0.0"

models:
  my_dbt_project:
    +materialized: table

No schema yaml file.

  1. Build with 1.8:
$ dbt --debug build
...
23:06:48  On model.my_dbt_project.my_inc: /* {"app": "dbt", "dbt_version": "1.8.6", "dbt_databricks_version": "1.8.5", "databricks_sql_connector_version": "3.1.2", "profile_name": "all", "target_name": "db", "node_id": "model.my_dbt_project.my_inc"} */
-- back compat for old kwarg name

  merge into `dev`.`dbt_jyeo`.`my_inc` as DBT_INTERNAL_DEST
      using `my_inc__dbt_tmp` as DBT_INTERNAL_SOURCE
      on FALSE
      when matched then update set *
      when not matched then insert *

23:06:48  Databricks adapter: Cursor(session-id=01ef7a00-7fe1-1449-a50a-89837d2c59a8, command-id=Unknown) - Created cursor
23:06:51  SQL status: OK in 2.870 seconds
23:06:51  Databricks adapter: Cursor(session-id=01ef7a00-7fe1-1449-a50a-89837d2c59a8, command-id=01ef7a00-82f1-1d18-ba94-f9d2628e7765) - Closing cursor
23:06:51  Databricks adapter: DatabricksDBTConnection(id=5553398992, session-id=01ef7a00-7fe1-1449-a50a-89837d2c59a8, name=model.my_dbt_project.my_inc, idle-time=5.748247861862183s, acquire-count=1, language=sql, thread-identifier=(82697, 6162509824), compute-name=) - Checking idleness
23:06:51  Databricks adapter: DatabricksDBTConnection(id=5553398992, session-id=01ef7a00-7fe1-1449-a50a-89837d2c59a8, name=model.my_dbt_project.my_inc, idle-time=5.748767137527466s, acquire-count=1, language=sql, thread-identifier=(82697, 6162509824), compute-name=) - Retrieving connection
23:06:51  Using databricks connection "model.my_dbt_project.my_inc"
23:06:51  On model.my_dbt_project.my_inc: /* {"app": "dbt", "dbt_version": "1.8.6", "dbt_databricks_version": "1.8.5", "databricks_sql_connector_version": "3.1.2", "profile_name": "all", "target_name": "db", "node_id": "model.my_dbt_project.my_inc"} */
ALTER table `dev`.`dbt_jyeo`.`my_inc` CLUSTER BY NONE
23:06:51  Databricks adapter: Cursor(session-id=01ef7a00-7fe1-1449-a50a-89837d2c59a8, command-id=Unknown) - Created cursor
23:06:52  Databricks adapter: Cursor(session-id=01ef7a00-7fe1-1449-a50a-89837d2c59a8, command-id=Unknown) - Closing cursor
23:06:52  Databricks adapter: Exception while trying to execute query
/* {"app": "dbt", "dbt_version": "1.8.6", "dbt_databricks_version": "1.8.5", "databricks_sql_connector_version": "3.1.2", "profile_name": "all", "target_name": "db", "node_id": "model.my_dbt_project.my_inc"} */
ALTER table `dev`.`dbt_jyeo`.`my_inc` CLUSTER BY NONE
: [DELTA_ALTER_TABLE_CLUSTER_BY_ON_PARTITIONED_TABLE_NOT_ALLOWED] ALTER TABLE CLUSTER BY cannot be applied to a partitioned table.
Error properties: diagnostic-info=org.apache.hive.service.cli.HiveSQLException: Error running query: [DELTA_ALTER_TABLE_CLUSTER_BY_ON_PARTITIONED_TABLE_NOT_ALLOWED] com.databricks.sql.transaction.tahoe.DeltaAnalysisException: [DELTA_ALTER_TABLE_CLUSTER_BY_ON_PARTITIONED_TABLE_NOT_ALLOWED] ALTER TABLE CLUSTER BY cannot be applied to a partitioned table.
        at org.apache.spark.sql.hive.thriftserver.HiveThriftServerErrors$.runningQueryError(HiveThriftServerErrors.scala:49)
        at 
<TRUNCATED>
Caused by: com.databricks.sql.transaction.tahoe.DeltaAnalysisException: [DELTA_ALTER_TABLE_CLUSTER_BY_ON_PARTITIONED_TABLE_NOT_ALLOWED] ALTER TABLE CLUSTER BY cannot be applied to a partitioned table.
        at com.databricks.sql.transaction.tahoe.DeltaErrorsEdge.alterTableClusterByOnPartitionedTableException(DeltaErrorsEdge.scala:677)
        at com.databricks.sql.transaction.tahoe.DeltaErrorsEdge.alterTableClusterByOnPartitionedTableException$(DeltaErrorsEdge.scala:676)
        at <TRUNCATED>
        ... 43 more
, operation-id=01ef7a00-84ac-175c-bd54-05270a4423d9
23:06:52  Databricks adapter: DatabricksDBTConnection(id=5553398992, session-id=01ef7a00-7fe1-1449-a50a-89837d2c59a8, name=model.my_dbt_project.my_inc, idle-time=5.245208740234375e-06s, acquire-count=0, language=sql, thread-identifier=(82697, 6162509824), compute-name=) - Released connection
23:06:52  Runtime Error in model my_inc (models/my_inc.sql)
  [DELTA_ALTER_TABLE_CLUSTER_BY_ON_PARTITIONED_TABLE_NOT_ALLOWED] ALTER TABLE CLUSTER BY cannot be applied to a partitioned table.
23:06:52  Databricks adapter: DatabricksDBTConnection(id=5553398992, session-id=01ef7a00-7fe1-1449-a50a-89837d2c59a8, name=model.my_dbt_project.my_inc, idle-time=1.1920928955078125e-06s, acquire-count=0, language=sql, thread-identifier=(82697, 6162509824), compute-name=) - Released connection
23:06:52  Sending event: {'category': 'dbt', 'action': 'run_model', 'label': 'b69d65ab-24cb-484d-86da-b1df0232e357', 'context': [<snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x14c840850>]}
23:06:52  1 of 1 ERROR creating sql incremental model dbt_jyeo.my_inc .................... [ERROR in 7.16s]

^ After the merge, we try to ALTER ... CLUSTER BY NONE which causes the error.

Back on 1.7 - we didn't attempt to do that and therefore no errors:

$ dbt --debug run
23:08:54  On model.my_dbt_project.my_inc: /* {"app": "dbt", "dbt_version": "1.7.18", "dbt_databricks_version": "1.7.17", "databricks_sql_connector_version": "2.9.6", "profile_name": "all", "target_name": "db", "node_id": "model.my_dbt_project.my_inc"} */
-- back compat for old kwarg name

  merge into `dev`.`dbt_jyeo`.`my_inc` as DBT_INTERNAL_DEST
      using `my_inc__dbt_tmp` as DBT_INTERNAL_SOURCE
      on FALSE
      when matched then update set *
      when not matched then insert *

23:08:54  Databricks adapter: Cursor(session-id=01ef7a00-b8ed-1703-9a62-5a2d12ff093a, command-id=Unknown) - Created cursor
23:08:57  SQL status: OK in 2.9200000762939453 seconds
23:08:57  Databricks adapter: Cursor(session-id=01ef7a00-b8ed-1703-9a62-5a2d12ff093a, command-id=01ef7a00-ce07-188a-a0e0-b5e5c4e3cd65) - Closing cursor
23:08:57  Timing info for model.my_dbt_project.my_inc (execute): 11:08:52.415205 => 11:08:57.355033
23:08:57  Databricks adapter: DatabricksDBTConnection(id=5580609680, session-id=01ef7a00-b8ed-1703-9a62-5a2d12ff093a, name=model.my_dbt_project.my_inc, idle-time=1.9073486328125e-06s, acquire-count=0, language=sql, thread-identifier=(83658, 6166458368), compute-name=) - Released connection
23:08:57  Databricks adapter: DatabricksDBTConnection(id=5580609680, session-id=01ef7a00-b8ed-1703-9a62-5a2d12ff093a, name=model.my_dbt_project.my_inc, idle-time=9.5367431640625e-07s, acquire-count=0, language=sql, thread-identifier=(83658, 6166458368), compute-name=) - Released connection
23:08:57  Sending event: {'category': 'dbt', 'action': 'run_model', 'label': '857325f9-af02-497c-849b-db17a658cf14', 'context': [<snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x13ed2ce10>]}
23:08:57  1 of 1 OK created sql incremental model dbt_jyeo.my_inc ........................ [OK in 4.95s]

Expected behavior

Perhaps not error when customers upgrade from 1.7 > 1.8 (Versionless in dbt Cloud) by not sending the ALTER ... CLUSTER BY NONE query if the table has a partition (even if the partition was not declared in a schema yaml file partition_by config?).

Screenshots and log output

As above.

System information

The output of dbt --version:

Core:
  - installed: 1.8.6
  - latest:    1.8.6 - Up to date!
Plugins:
  - databricks: 1.8.5 - Update available!

The operating system you're using: macOS

The output of python --version: 3.11.9

Additional context

Additionally, the customer mentioned that it would be useful to have had some documentation that:

dbt will run ALTER TABLE CLUSTER BY NONE on every model in databricks irrespective of it having any cluster config or not

^ I don't really know where this would go personally but I guess it wasn't too clear that this was happening if you just read release notes here: https://github.com/databricks/dbt-databricks/releases/tag/v1.8.2

One would have to deep dive into the actual code changes.

benc-db commented 1 month ago

Thanks for reporting...liquid clustering optimization needs a rethink.

benc-db commented 1 month ago

One of the issues here is that the table is created and partitioned outside of dbt. If the table's partition existed inside of dbt, the ALTER statement would not be attempted.