CorsettiS commented 4 months ago

Description

I am currently working in a company that is using an in-house solution to run dbt-like transformations on databricks and I am trying to switch to dbt & cosmos. The issue arrives in the fact that the current solution has an extra level of control which is the ability to spin up a cluster with custom resources, run the transformation in that cluster and, once concluded, the cluster stops. The time wasted spinning up clusters for singular SQL tasks is not a big concern, since the money saved on the process is tremendous in comparison with deploying dbt models in an all-purpose cluster or in the databricks SQL warehouse, which is what dbt supports at the moment.

General Idea

For dbt-databricks DAGs, there should be an extra operator with the option to pass a json file as a parameter mapping cluster resources for individual nodes
The dbt node and its corresponding tests should run in the same cluster

Implementation ideas

Functionality

Allow cosmos to call databricks API for some custom Operator (for example, DbtDatabricksClusterOperator)

airflow DatabricksSubmitRunOperator allows to submit dbt_task commands from it. The issue I saw is that the PR that added this feature hard-codes it to only work with git sync, which is the opposite direction of cosmos. I can submit a PR later on airflow repo to change that, but for now as a solution I can think of is directly calling databricks API which allows us to specify the project path. The drawback I can think about is that in the case the user is using the manifest.json to trigger jobs, this approach would not work by default. An alternative would be to run a task's query as a databricks sql_task API request, which would be a more universal approach since cosmos is essentially doing dbt's job by rendering the query in the first place.

add an extra field (cluster_resources) to this operator

The custom operator would have an extra field where we can map the dbt task name with the cluster configs it should have. one of the fields can be called default and if no task cluster config is given, the task should use the default resources.

DbtDatabricksClusterOperator(..., cluster_resources = { "default" : { "new_cluster" : {...}}, "my_task_1" : {"new_cluster":{...}}, ... } )

Observation

This is the most straightforward idea I could think of, but it has been a long time I do not touch the project and I am more than open for suggestions before I even start working on it. I thought about other ideas, such as maybe deploying the models to different warehouses based on tags (low, medium, high), spinning up the cluster with a custom task and shutting it down at the end of a successful task and so on, but all of them had implementation problems. What do you think @tatiana ? Any feedback before I start implementing? or recommendations? The project changed a lot since the last time I dealt with it, so maybe there is a better way I am not aware of.

Other useful links

specifying compute resources for nodes in the dbt_project.yaml

Use case/motivation

There is a huge price difference depending on how you deploy your dbt model on databricks, and unfortunately dbt does not support the most economical way possible : creating a cluster per task that is terminated once the task is done
when you have an all-purpose data warehouse for dbt transformations, you will have to deal with dbt nodes competing for resources
It is easier to track resource consumption per dbt node when you map one task to one cluster

Related issues

No response

Are you willing to submit a PR?

[X] Yes, I am willing to submit a PR!

pankajkoti commented 3 months ago

Hi, I'm wondering if this was closed prematurely without the feature being fully completed.

I'm re-opening this issue, but please feel free to enlighten me otherwise.

CorsettiS commented 3 months ago

@pankajkoti it was indeed closed prematurely. I have contacted databricks support to discuss about it just to be left frustrated with the conversation though. so basically databricks does not support at all interacting with catalogs in job clusters with dbt-databricks. There is a work around that, which is by using dbt-spark itself. The problem comes that the level of customization required is very high just to be able to run the dbt task itself. I was able to make it work but the whole process felt hacky and not very resilient, so until databricks gives a proper support for dbt (which they said they are planning to in the future) I will close this issue for now