databricks / dbt-databricks

A dbt adapter for Databricks.
https://databricks.com
Apache License 2.0
212 stars 113 forks source link

Running dbt-databricks on a job cluster #575

Open leo-schick opened 7 months ago

leo-schick commented 7 months ago

Describe the feature

It is possible to run dbt SQL models inside a job cluster when:

I would like to see:

Describe alternatives you've considered

I did several tests with the token based authentication but it looks like that in job clusters have another spark endpoint. Token based authentication does not work on a job cluster.

Something to note

Python models currently do not work with this approach.

Who will this benefit?

The Databricks license costs are reduced, because no general purpose cluster is necessary to run dbt inside databricks.

gaoshihang commented 4 months ago

Hi @leo-schick, could you please give a example how to run SQL models on Databricks job Cluster? Thanks!

leo-schick commented 4 months ago

@gaoshihang I wrote now an article on Medium How to run dbt on a Databricks job cluster

moritzmeister commented 3 months ago

@leo-schick feel free to upvote our feature request for a more native way to run dbt on job clusters https://ideas.databricks.com/ideas/DBE-I-1415

leo-schick commented 3 months ago

@moritzmeister i do not have access to this page

chrismbeach commented 3 months ago

@leo-schick - I'm very interested in this topic right now, also. I have a case where we have a number of models that need to be run for a given task, and until now we've just been eating the cost of running an all-purpose cluster and directing model exec to there, but I'm trying to switch to job clusters right now. I have a test case working using the submission_method: job_cluster config - but of course that's triggering a cluster per model as you mentioned.

I tried setting up a shared job cluster, capturing the cluster id and passing it through to the models to use - but hit a bizarre access issue where it tells me (despite my account having 'can manage' on the cluster in question):

Error creating an execution context.
   b'{"error":"WorkspaceAclExceptions.WorkspacePermissionDeniedException: my.account@my.com does not have Attach permissions on 0613-041925-akezdk3x. Please contact the owner or an administrator for access."}\n'

Which seems like a pretty misleading response... any thoughts?

@moritzmeister - likewise, can you share access to that link? I'll happily upvote also!

leo-schick commented 3 months ago

@chrismbeach have you tried using the approach I mentioned in my Medium post? How to run dbt on a Databricks job cluster

You can find my helper Notebooks here: https://github.com/leo-schick/databricks-dbt-helper

chrismbeach commented 3 months ago

Thanks @leo-schick - I've not - since it's dbt python models I need to run :( Per your summary in https://github.com/databricks/dbt-databricks/issues/586 that doesn't appear viable atm, due to (seemingly unreasonable) access restrictions?

moritzmeister commented 3 months ago

Hey @leo-schick, hey @chrismbeach, you should get access to the page if you have access to the Databricks support. I think you need a support contract with them for that.

I also talked with the Databricks support about this, this was their response:

There is currently no plan to be able to run Python dbt models on SQL warehouses. For such scenarios, if there is any use case that cannot be run on SQL warehouses, there is an option to run it on the all-purpose clusters."

To summarise - There is no plan/roadmap for running python/pyspark dbt models on SQL warehouses and to give an option of using Job Clusters with dbt models. The reason, as I mentioned earlier, is "dbt-databricks is optimized to work best against Databricks SQL warehouses as local development is typically carried out by users using Databricks SQL", and there is currently no plan to run python/pyspark dbt models on it.

Not really satisfying.