astronomer / astro-provider-databricks

Orchestrate your Databricks notebooks in Airflow and execute them as Databricks Workflows
Apache License 2.0
21 stars 11 forks source link

Support sending parameterized SQL queries to Databricks Jobs #26

Open jlaneve opened 1 year ago

jlaneve commented 1 year ago

Problem statement

Currently the DatabricksWorkflowTaskGroup only supports creating notebook tasks using the DatabricksNotebookOperator. While this feature unlocks all databricks python-based development (and to some extent SQL through spark.sql commands), it does not allow users to take advantage of the Databricks SQL, which would limit the flows that users can create.

To solve this, we should offer support for adding support for sql_task tasks.

sql_task tasks allow databricks to refer to query objects that have been created in the databricks SQL editor. These queries can be parameterized by the user at runtime.

Screenshot 2023-02-05 at 12 34 13 PM

Solving this issue would involve two steps:

The first step is to create a DatabricksSqlQueryOperator that expects a query ID instead of a SQL query. If run outside of a DatabricksWorkflowTaskgroup, this operator would be able to launch and monitor a SQL task on its own. The second step would be to create a convert_to_databricks_workflow_task to convert the SQL operator task into a workflow task.

For this task to be completed, a SQL query should be added to the example DAG and should run through CI/CD.