Draft: #756 - implement python workflow submissions

kdazzle commented 3 months ago

WIP - Stubs out implementation for #756

This pretty much implements what a workflow job submission type would look like, though I'm sure I'm missing something. Tests haven't been added yet.

Sample

Outside of the new submission type, models are the same. Here is what one could look like:

# my_model.py
import pyspark.sql.types as T
import pyspark.sql.functions as F

def model(dbt, session):
    dbt.config(
        materialized='incremental',
        submission_method='workflow_job'
    )

    output_schema = T.StructType([
        T.StructField("id", T.StringType(), True),
        T.StructField("odometer_meters", T.DoubleType(), True),
        T.StructField("timestamp", T.TimestampType(), True),
    ])
    return spark.createDataFrame(data=spark.sparkContext.emptyRDD(), schema=output_schema)

The config for a model could look like (forgive my jsonification...yaml data structures still freak me out):

models:
  - name: my_model
      workflow_job_config:
        email_notifications: {
          on_failure: ["reynoldxin@databricks.com"]
        }
        max_retries: 2
        timeout_seconds: 18000
        existing_job_id: 12341234  # not part of Databricks API (+ optional)
        additional_task_settings: {  # not part of Databricks API (+ optional)
          "task_key": "my_dbt_task"
        }
        post_hook_tasks: [{  # not part of Databricks API (+ optional)
          "depends_on": [{ "task_key": "my_dbt_task" }],
          "task_key": 'OPTIMIZE_AND_VACUUM',
          "notebook_task": {
            "notebook_path": "/my_notebook_path",
            "source": "WORKSPACE",
          },
        }]
        grants:  # not part of Databricks API (+ optional)
          view: [
            {"group_name": "marketing-team"},
          ]
          run: [
            {"user_name": "alighodsi@databricks.com"}
          ]
          manage: []
      job_cluster_config:
        spark_version: "15.3.x-scala2.12"
        node_type_id: "rd-fleet.2xlarge"
        runtime_engine: "STANDARD"
        data_security_mode: "SINGLE_USER"
        autoscale: {
          "min_workers": 1,
          "max_workers": 4
        }

Explanation

For all of the dbt configs that I added (in addition to the Databricks API attributes), I tried to roughly mediate between the dbt convention of requiring minimal configuration, but also allowing for the full flexibility of the Databricks API. Attribute names were trying to split the difference between the Databricks API and the dbt API. Happy to change the approach for anything.

added existing_job_id in case users want to reuse an existing workflow. If no name is provided in this config, it will get renamed to the default job name (currently f"dbt__{self.database}-{self.schema}-{self.identifier}")
Job names must be unique unless existing_job_id is also provided
The task key for the model run task is hardcoded as task_a - configurable in additional_task_settings
Allow for "post_hook tasks"
- Can specify a different cluster type using Databricks' new_cluster or existing_cluster_id. Leaving blank is serverless
- post_hook might be a misnomer, because you could technically set the dbt model to depend on one of these tasks, making it also a pre hook
grants - allow for permissions to be set on the workflow so that additional users/teams can run the job ad hoc if needed (for initial runs/backfills, etc). The owner is the user/service principal that deployed, and the format needs to follow the Databricks API where you specify whether the user is a user, group, or SP.
additional_task_settings to add to/override the default dbt model task

Todo:

[ ] Reuse all_purpose_cluster attribute, similar to job_cluster_config?
[x] Can I use a serverless job cluster? (by not defining any cluster)
[x] Fix the run tracker
[x] What happens if the workflow is already running?
- I'd like the new dbt job run to start tracking the current Databricks workflow run, rather than failing
[ ] Log if workflow permissions are being changed? (Kind of mimicking TF apply logs, which have been helpful in the past when table permissions had been unexpectedly broadened)

Description

Checklist

[x] I have run this code in development and it appears to resolve the stated issue
[x] This PR includes tests, or tests are not required/relevant for this PR
[ ] I have updated the CHANGELOG.md and added information about my change to the "dbt-databricks next" section.

benc-db commented 1 month ago

@kdazzle can you rebase/target your PR against 1.9.latest? I have a couple of things that I need to wrap up, but I'm planning to take some version of this into the 1.9 release.

benc-db commented 1 month ago

Looks like some syntax you're using does not work with python 3.8 based on unit test failures.

benc-db commented 1 month ago

Going to pull/push to origin to run the existing functional tests. We should add one for this new code. Let me know if you need help with that.

benc-db commented 1 month ago

Going to merge in 1.9.latest changes (which is basically only 1.8 changes), ensure tests still pass, then merge.

databricks / dbt-databricks