databricks / dbt-databricks

A dbt adapter for Databricks.
https://databricks.com
Apache License 2.0
216 stars 118 forks source link

Run dbt python models as full-fledged jobs (not one-time job submission) #756

Open kdazzle opened 2 months ago

kdazzle commented 2 months ago

Describe the feature

I'd like to leverage Databricks' job running functionality but still use dbt to manage the orchestration and development workflow. So, from the Databricks side, instead of running my python jobs using the one-time submission API, I would want a full-fledged Workflow Job to be created for certain dbt python models. This would be a new submission_method. I think it will fill in some gaps in the dbt-databricks integration and make it easier for heavy Databricks users to migrate pipelines into dbt/dbt cloud.

One of the reasons is that Dbt Cloud's job management features are limited compared to Databricks. Ex: natively, dbt doesn't offer retries out of the box, timeout warnings, runs timeout after a max of 24 hours, their notifications need a lot of work, the Databricks run history is great, etc. All things that Databricks has nailed down pretty well over the years.

Another reason is that, even if dbt had feature parity, there are a bunch of people who are using these features from the Databricks side, and who are happy with it. I'd like to leverage dbt for the dev lifecycle/orchestration improvements, and continue to leverage Databricks for other things. Allowing models to materialize as Workflows accomplishes both of those goals. This makes it easier for people to move their workloads into dbt.

Implementation

See #762

TLDR; Creates a Databricks Workflow for each model called my-database-my-schema_my-model__dbt. It gets triggered to run in the usual dbt manner (dbt run --select=my_model).

Also allows post_hook_tasks to be added to the workflow.

Describe alternatives you've considered

The current setup, via the dbt-databricks all-purpose_cluster/job_cluster methods. But these are somewhat limiting because:

I've also considered running dbt cloud jobs via Workflows (docs). However, I'd still like to run all of my jobs through dbt Cloud/dbt core. This seems like it would make my setup more complicated: do I create a separate dbt cloud job for every workflow that I want to run? How would this work in development/CI? What if I want to run several models at once - do I run them each separately in Databricks?

Additional context

Please include any other relevant context here.

Who will this benefit?

Anyone using python models in dbt-databricks

Are you interested in contributing this feature?

Let us know if you want to write some code, and how we can help.

Yes - I'd be happy to help. I've got some time I could set aside for this in late August/september

benc-db commented 2 months ago

What is the advantage of doing it this way, over using Databricks Workflow with dbt task type?

kdazzle commented 2 months ago

Hey @benc-db good point, I forgot about the dbt task type. What I'm proposing is to be able to integrate a Databricks Workflow job into the rest of the DAG, even if other models are being run via dbt Cloud. In which case, if a model were created with dbt task type, that would be great, too (assuming you can use more than just a SQL Warehouse).

My impression is that the dbt task type isn't meant for an individual model - it's more for the whole workflow. But I haven't used that task type, tbh.

Having to use Databricks Workflows for all dbt models wouldn't be ideal, as we wouldn't be able to hook into features from dbt cloud, among other reasons