databricks / dbt-databricks

A dbt adapter for Databricks.
https://databricks.com
Apache License 2.0
210 stars 112 forks source link

Expecting all purpose cluster stop be utilised when configured job cluster detail #576

Open tade0726 opened 7 months ago

tade0726 commented 7 months ago

Describe the bug

When I try the feature of running jobs on the job cluster instead of the all-purpose cluster, both the purpose cluster and the job cluster trigger, and only when the all-purpose starts, and the the job cluster follows.

Steps To Reproduce

  1. Configure all purpose cluster detail in profiles.yaml
  2. Configure job cluster in dbt_project.yml, follow instructions (https://docs.getdbt.com/docs/build/python-models#specific-data-platforms)

Expected behavior

Once configured job cluster detail provided, all purpose cluster should not be trigged.

Screenshots and log output

System information

The output of dbt --version:

Core:
  - installed: 1.7.3
  - latest:    1.7.7 - Update available!

  Your version of dbt-core is out of date!
  You can find instructions for upgrading here:
  https://docs.getdbt.com/docs/installation

Plugins:
  - databricks: 1.7.2 - Update available!
  - spark:      1.7.1 - Up to date!

  At least one plugin is out of date or incompatible with dbt-core.
  You can find instructions for upgrading here:
  https://docs.getdbt.com/docs/installation

The operating system you're using:

Mac OS, should be irrelevant

The output of python --version:

Python 3.10.13

Additional context

NaN

benc-db commented 7 months ago

This is expected behavior, as python models are integrated into the rest of your dbt project using SQL (for example, on an incremental model, the merge behavior is conducted in SQL), and that SQL would be executed on the AP Cluster. We are investigating ways for python model behavior to be more 'spark-like', but for now I would say this is an enhancement request, rather than a bug, as it is consistent with the structure imposed by dbt-core.

tade0726 commented 7 months ago

Thanks, Benc. It clears my doubts.

leo-schick commented 7 months ago

@benc-db Would it be possible to use a more simple approach when running a python model inside a job cluster like following:

  1. dbt creates a new notebook for the python model
  2. the new notebook is executed withing dbt using python command dbutils.notebook.run("....") (see Run a Databricks notebook from another notebook) inside a own process

I am not sure but it looks to me, that the strict seperation between execution (dbt python code) and the model execution (putting model into an isolated space) seems to be a bit oversized on Databricks job clusters, because the job will run nevertheless on spark on the master node. But maybe I am not getting the full picture of this issue...

github-actions[bot] commented 1 month ago

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please remove the stale label or comment on the issue.