Closed CorsettiS closed 3 months ago
Hi, I'm wondering if this was closed prematurely without the feature being fully completed.
I'm re-opening this issue, but please feel free to enlighten me otherwise.
@pankajkoti it was indeed closed prematurely. I have contacted databricks support to discuss about it just to be left frustrated with the conversation though. so basically databricks does not support at all interacting with catalogs in job clusters with dbt-databricks. There is a work around that, which is by using dbt-spark itself. The problem comes that the level of customization required is very high just to be able to run the dbt task itself. I was able to make it work but the whole process felt hacky and not very resilient, so until databricks gives a proper support for dbt (which they said they are planning to in the future) I will close this issue for now
Thanks for the details and context @CorsettiS 🙂🙌
Description
I am currently working in a company that is using an in-house solution to run dbt-like transformations on databricks and I am trying to switch to dbt & cosmos. The issue arrives in the fact that the current solution has an extra level of control which is the ability to spin up a cluster with custom resources, run the transformation in that cluster and, once concluded, the cluster stops. The time wasted spinning up clusters for singular SQL tasks is not a big concern, since the money saved on the process is tremendous in comparison with deploying dbt models in an all-purpose cluster or in the databricks SQL warehouse, which is what dbt supports at the moment.
General Idea
Implementation ideas
Functionality
airflow DatabricksSubmitRunOperator allows to submit dbt_task commands from it. The issue I saw is that the PR that added this feature hard-codes it to only work with git sync, which is the opposite direction of cosmos. I can submit a PR later on airflow repo to change that, but for now as a solution I can think of is directly calling databricks API which allows us to specify the project path. The drawback I can think about is that in the case the user is using the
manifest.json
to trigger jobs, this approach would not work by default. An alternative would be to run a task's query as a databricks sql_task API request, which would be a more universal approach since cosmos is essentially doing dbt's job by rendering the query in the first place.cluster_resources
) to this operatorThe custom operator would have an extra field where we can map the dbt task name with the cluster configs it should have. one of the fields can be called
default
and if no task cluster config is given, the task should use the default resources.DbtDatabricksClusterOperator(..., cluster_resources = { "default" : { "new_cluster" : {...}}, "my_task_1" : {"new_cluster":{...}}, ... } )
Observation
This is the most straightforward idea I could think of, but it has been a long time I do not touch the project and I am more than open for suggestions before I even start working on it. I thought about other ideas, such as maybe deploying the models to different warehouses based on tags (
low
,medium
,high
), spinning up the cluster with a custom task and shutting it down at the end of a successful task and so on, but all of them had implementation problems. What do you think @tatiana ? Any feedback before I start implementing? or recommendations? The project changed a lot since the last time I dealt with it, so maybe there is a better way I am not aware of.Other useful links
specifying compute resources for nodes in the dbt_project.yaml
Use case/motivation
There is a huge price difference depending on how you deploy your dbt model on databricks, and unfortunately dbt does not support the most economical way possible : creating a cluster per task that is terminated once the task is done
when you have an all-purpose data warehouse for dbt transformations, you will have to deal with dbt nodes competing for resources
It is easier to track resource consumption per dbt node when you map one task to one cluster
Related issues
No response
Are you willing to submit a PR?