databricks / dbt-databricks

A dbt adapter for Databricks.
https://databricks.com
Apache License 2.0
226 stars 119 forks source link

Unable to execute multiple SQL statements on one entry in hooks config #831

Open geo-martino opened 1 month ago

geo-martino commented 1 month ago

Describe the bug

I am attempting to execute multiple SQL statements on various hooks and seem to be unable to execute multiple statements separated by ; within one entry.

I am generating multiple GRANT statements with a macro. I've rewritten an abridged version of the macro and it's usage below showing how it is used in both the on-run-start and on-run-end hooks in the dbt_project file on one entry.

I find that, if a macro generates multiple separate SQL statements, then the command fails. However if a macro generates just one SQL statement, the command executes with no error. I see this exact behaviour when configuring on-run-start, on-run-end, pre-hook, and post-hook hooks.

As a workaround, I have set the macros to execute each statement when called by using the run_query macro within the grant_access macro.

Steps To Reproduce

A failing macro which generates multiple SQL statements:

{% macro grant_access(table_name) -%}
  {%- set group_name = "group" -%}
  {%- set catalog_identifier = "`catalog_name`" -%}
  {%- set schema_identifier = catalog_identifier ~ "`schema_name`" -%}
  {%- set schema_identifier = schema_identifier ~ table_name -%}

  {%- if execute -%}
      GRANT USE CATALOG ON CATALOG {{ catalog_identifier }} TO `{{ group_name }}`;
      GRANT USE SCHEMA ON SCHEMA {{ schema_identifier }} TO `{{ group_name }}`;
      GRANT SELECT ON TABLE {{ table_identifier }} TO `{{ group_name }}`;
      GRANT MODIFY ON TABLE {{ table_identifier }} TO `{{ group_name }}`;
  {%- endif -%}
{%- endmacro %}

A successful macro which generates only one SQL statement:

{% macro grant_access_catalog(table_name) -%}
  {%- set group_name = "group" -%}
  {%- set catalog_identifier = "`catalog_name`" -%}
  {%- set schema_identifier = catalog_identifier ~ "`schema_name`" -%}
  {%- set schema_identifier = schema_identifier ~ table_name -%}

  {%- if execute -%}
      GRANT USE CATALOG ON CATALOG {{ catalog_identifier }} TO `{{ group_name }}`;
  {%- endif -%}
{%- endmacro %}

dbt_project.yml:

on-run-start:
  - "{{ grant_access('start_table') }}"
on-run-end:
  - "{{ grant_access_catalog('end_table') }}"

Whenever I try to execute dbt run, these commands fail.

Expected behavior

The statements should execute successfully. The documentation for dbt shows that it is possible to build and execute multiple statements with jinja within one entry on these config keys.

Screenshots and log output

18:32:10.323706 [debug] [MainThread]: On master: /* {"app": "dbt", "dbt_version": "1.8.7", "dbt_databricks_version": "1.8.7", "databricks_sql_connector_version": "3.1.2", "profile_name": "cibc", "target_name": "local", "connection_name": "master"} */

        GRANT USE CATALOG ON CATALOG `catalog_name` TO `deployments`;
        GRANT USE SCHEMA ON SCHEMA `catalog_name`.`schema_name` TO `deployments`;
        GRANT SELECT ON TABLE `catalog_name`.`schema_name`.`start_table` TO `deployments`;
        GRANT MODIFY ON TABLE `catalog_name`.`schema_name`.`start_table` TO `deployments`;
18:32:10.324107 [debug] [MainThread]: Databricks adapter: Cursor(session-id=4a9c2a98-f79f-46cb-9b3c-cad9bea26f00, command-id=Unknown) - Created cursor
18:32:10.548658 [debug] [MainThread]: Databricks adapter: Cursor(session-id=4a9c2a98-f79f-46cb-9b3c-cad9bea26f00, command-id=Unknown) - Closing cursor
18:32:10.550652 [debug] [MainThread]: Databricks adapter: Exception while trying to execute query
/* {"app": "dbt", "dbt_version": "1.8.7", "dbt_databricks_version": "1.8.7", "databricks_sql_connector_version": "3.1.2", "profile_name": "cibc", "target_name": "local", "connection_name": "master"} */

        GRANT USE CATALOG ON CATALOG `catalog_name` TO `deployments`;
        GRANT USE SCHEMA ON SCHEMA `catalog_name`.`schema_name` TO `deployments`;
        GRANT SELECT ON TABLE `catalog_name`.`schema_name`.`start_table` TO `deployments`;
        GRANT MODIFY ON TABLE `catalog_name`.`schema_name`.`start_table` TO `deployments`;
: 
[PARSE_SYNTAX_ERROR] Syntax error at or near 'USE'. SQLSTATE: 42601 (line 3, pos 14)

== SQL ==
/* {"app": "dbt", "dbt_version": "1.8.7", "dbt_databricks_version": "1.8.7", "databricks_sql_connector_version": "3.1.2", "profile_name": "cibc", "target_name": "local", "connection_name": "master"} */

        GRANT USE CATALOG ON CATALOG `catalog_name` TO `deployments`;
--------------^^^
        GRANT USE SCHEMA ON SCHEMA `catalog_name`.`schema_name` TO `deployments`;
        GRANT SELECT ON TABLE `catalog_name`.`schema_name`.`start_table` TO `deployments`;
        GRANT MODIFY ON TABLE `catalog_name`.`schema_name`.`start_table` TO `deployments`

Error properties: diagnostic-info=org.apache.hive.service.cli.HiveSQLException: Error running query: [PARSE_SYNTAX_ERROR] org.apache.spark.sql.catalyst.parser.ParseException: 
[PARSE_SYNTAX_ERROR] Syntax error at or near 'USE'. SQLSTATE: 42601 (line 3, pos 14)

== SQL ==
/* {"app": "dbt", "dbt_version": "1.8.7", "dbt_databricks_version": "1.8.7", "databricks_sql_connector_version": "3.1.2", "profile_name": "cibc", "target_name": "local", "connection_name": "master"} */

        GRANT USE CATALOG ON CATALOG `catalog_name` TO `deployments`;
--------------^^^
        GRANT USE SCHEMA ON SCHEMA `catalog_name`.`schema_name` TO `deployments`;
        GRANT SELECT ON TABLE `catalog_name`.`schema_name`.`start_table` TO `deployments`;
        GRANT MODIFY ON TABLE `catalog_name`.`schema_name`.`start_table` TO `deployments`

    at org.apache.spark.sql.hive.thriftserver.HiveThriftServerErrors$.runningQueryError(HiveThriftServerErrors.scala:49)
    at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.$anonfun$execute$1(SparkExecuteStatementOperation.scala:786)
    at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
    at com.databricks.unity.UCSEphemeralState$Handle.runWith(UCSEphemeralState.scala:51)
    at com.databricks.unity.HandleImpl.runWith(UCSHandle.scala:104)
    at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$execute(SparkExecuteStatementOperation.scala:624)
    at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2$$anon$3.$anonfun$run$5(SparkExecuteStatementOperation.scala:469)
    at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
    at org.apache.spark.sql.execution.SQLExecution$.withRootExecution(SQLExecution.scala:704)
    at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2$$anon$3.$anonfun$run$2(SparkExecuteStatementOperation.scala:469)
    at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
    at com.databricks.logging.AttributionContextTracing.$anonfun$withAttributionContext$1(AttributionContextTracing.scala:48)
    at com.databricks.logging.AttributionContext$.$anonfun$withValue$1(AttributionContext.scala:276)
    at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62)
    at com.databricks.logging.AttributionContext$.withValue(AttributionContext.scala:272)
    at com.databricks.logging.AttributionContextTracing.withAttributionContext(AttributionContextTracing.scala:46)
    at com.databricks.logging.AttributionContextTracing.withAttributionContext$(AttributionContextTracing.scala:43)
    at com.databricks.spark.util.PublicDBLogging.withAttributionContext(DatabricksSparkUsageLogger.scala:27)
    at com.databricks.logging.AttributionContextTracing.withAttributionTags(AttributionContextTracing.scala:95)
    at com.databricks.logging.AttributionContextTracing.withAttributionTags$(AttributionContextTracing.scala:76)
    at com.databricks.spark.util.PublicDBLogging.withAttributionTags(DatabricksSparkUsageLogger.scala:27)
    at com.databricks.spark.util.PublicDBLogging.withAttributionTags0(DatabricksSparkUsageLogger.scala:74)
    at com.databricks.spark.util.DatabricksSparkUsageLogger.withAttributionTags(DatabricksSparkUsageLogger.scala:174)
    at com.databricks.spark.util.UsageLogging.$anonfun$withAttributionTags$1(UsageLogger.scala:617)
    at com.databricks.spark.util.UsageLogging$.withAttributionTags(UsageLogger.scala:729)
    at com.databricks.spark.util.UsageLogging$.withAttributionTags(UsageLogger.scala:738)
    at com.databricks.spark.util.UsageLogging.withAttributionTags(UsageLogger.scala:617)
    at com.databricks.spark.util.UsageLogging.withAttributionTags$(UsageLogger.scala:615)
    at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.withAttributionTags(SparkExecuteStatementOperation.scala:71)
    at org.apache.spark.sql.hive.thriftserver.ThriftLocalProperties.$anonfun$withLocalProperties$11(ThriftLocalProperties.scala:195)
    at com.databricks.spark.util.IdentityClaim$.withClaim(IdentityClaim.scala:48)
    at org.apache.spark.sql.hive.thriftserver.ThriftLocalProperties.withLocalProperties(ThriftLocalProperties.scala:190)
    at org.apache.spark.sql.hive.thriftserver.ThriftLocalProperties.withLocalProperties$(ThriftLocalProperties.scala:71)
    at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.withLocalProperties(SparkExecuteStatementOperation.scala:71)
    at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2$$anon$3.run(SparkExecuteStatementOperation.scala:446)
    at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2$$anon$3.run(SparkExecuteStatementOperation.scala:432)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899)
    at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2.run(SparkExecuteStatementOperation.scala:482)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:750)
Caused by: org.apache.spark.sql.catalyst.parser.ParseException: 
[PARSE_SYNTAX_ERROR] Syntax error at or near 'USE'. SQLSTATE: 42601 (line 3, pos 14)

== SQL ==
/* {"app": "dbt", "dbt_version": "1.8.7", "dbt_databricks_version": "1.8.7", "databricks_sql_connector_version": "3.1.2", "profile_name": "cibc", "target_name": "local", "connection_name": "master"} */

        GRANT USE CATALOG ON CATALOG `catalog_name` TO `deployments`;
--------------^^^
        GRANT USE SCHEMA ON SCHEMA `catalog_name`.`schema_name` TO `deployments`;
        GRANT SELECT ON TABLE `catalog_name`.`schema_name`.`start_table` TO `deployments`;
        GRANT MODIFY ON TABLE `catalog_name`.`schema_name`.`start_table` TO `deployments`

    at org.apache.spark.sql.catalyst.parser.ParseException.withCommand(parsers.scala:308)
    at org.apache.spark.sql.catalyst.parser.AbstractParser.parse(parsers.scala:114)
    at org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:137)
    at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(AbstractSqlParser.scala:106)
    at com.databricks.sql.parser.DatabricksSqlParser.$anonfun$parsePlan$1(DatabricksSqlParser.scala:80)
    at com.databricks.sql.parser.DatabricksSqlParser.parse(DatabricksSqlParser.scala:101)
    at com.databricks.sql.parser.DatabricksSqlParser.parsePlan(DatabricksSqlParser.scala:77)
    at com.databricks.sql.QueryRuntimePredictionUtils$.$anonfun$getParsedPlanWithTracking$2(QueryRuntimePrediction.scala:383)
    at com.databricks.spark.util.FrameProfiler$.record(FrameProfiler.scala:94)
    at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:452)
    at com.databricks.sql.QueryRuntimePredictionUtils$.$anonfun$getParsedPlanWithTracking$1(QueryRuntimePrediction.scala:382)
    at org.apache.spark.sql.execution.SQLExecution$.withExecutionPhase(SQLExecution.scala:143)
    at com.databricks.sql.QueryRuntimePredictionUtils$.getParsedPlanWithTracking(QueryRuntimePrediction.scala:382)
    at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.$anonfun$analyzeQuery$2(SparkExecuteStatementOperation.scala:608)
    at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:1180)
    at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.$anonfun$analyzeQuery$1(SparkExecuteStatementOperation.scala:590)
    at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.getOrCreateDF(SparkExecuteStatementOperation.scala:580)
    at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.analyzeQuery(SparkExecuteStatementOperation.scala:590)
    at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.$anonfun$execute$5(SparkExecuteStatementOperation.scala:663)
    at org.apache.spark.util.Utils$.timeTakenMs(Utils.scala:537)
    at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.$anonfun$execute$1(SparkExecuteStatementOperation.scala:663)
    ... 43 more

System information

The output of dbt --version:

Core:
  - installed: 1.8.7
  - latest:    1.8.7 - Up to date!

Plugins:
  - databricks: 1.8.7 - Up to date!
  - spark:      1.8.0 - Up to date!

The operating system you're using: MacOS Sequoia 15.0.1 (24A348)

The output of python --version: Python 3.12.7

Additional context

I am not sure if this is an issue with the dbt-databricks or dbt-core. I am posting here first as it appears that, according to the dbt-core documentation noted above, this should be supported. My guess therefore is that the issue stems from the adapter, but I will be happy to forward this issue to dbt-core if I am wrong.

benc-db commented 1 month ago

This is because our path for executing sql does not allow executing multiple statements separated by ;. This is a Databricks limitation, rather than a dbt limitation. I believe that even places in the Databricks UI that allow to execute multiple statements separated by ; do so by splitting them up and executing them one by one. We can keep this bug open though, as I can see this being a limitation in the grants operation is painful enough that we should rewrite the code the split and execute one by one; however, when I did this for MV/STs, it was pretty challenging, so it will probably not be fixed in the short term.

benc-db commented 1 month ago

The flip side though is that you can put them into hooks as separate entries:

"on-run-start": [
                "<first grant>",
                "<second grant>"
            ]
geo-martino commented 4 weeks ago

Understood. Yes it does quite limit our ability to logically group multiple related statements together for execution into one macro for ease-of-use within config files e.g. GRANT or ALTER statements. It would be excessive and make the hooks config quite messy to separate out the multiple grant statements into their own entries in the hooks config. And in many cases this is not possible as the exact queries that should be executed in a given hook are dynamically generated based on the properties of the model itself.

For now, we are able to get this to work using a workaround which executes the query at compile time, returning no statements to the hook itself like below.

{% macro grant_access(table_name) -%}
  {%- set group_name = "group" -%}
  {%- set catalog_identifier = "`catalog_name`" -%}
  {%- set schema_identifier = catalog_identifier ~ "`schema_name`" -%}
  {%- set schema_identifier = schema_identifier ~ table_name -%}

  {%- if execute -%}
    {%- set queries %}
      GRANT USE CATALOG ON CATALOG {{ catalog_identifier }} TO `{{ group_name }}`;
      GRANT USE SCHEMA ON SCHEMA {{ schema_identifier }} TO `{{ group_name }}`;
      GRANT SELECT ON TABLE {{ table_identifier }} TO `{{ group_name }}`;
      GRANT MODIFY ON TABLE {{ table_identifier }} TO `{{ group_name }}`;
    {%- endset -%}

    {{ run_queries(queries) }}
  {%- endif -%}
{%- endmacro %}

{% macro run_queries(queries) -%}
  {% for query in (queries | trim | trim(';')).split(';') -%}
    {%- do run_query(query | trim) -%}
  {%- endfor %}
{%- endmacro %}

The issue here being that, as the execution happens when the macro is compiled, the logs appear to hang before executing the hook. The hook then appears to execute instantaneously as no statement was returned by the macro to execute.

We will continue to use this workaround in the meantime, but it would be great to have this looked into in the future.