Python Model on Job Cluster

leo-schick commented 8 months ago

Describe the feature

Currently, there is no way (at least that I know) to run a python models inside a job cluster. It is possible to run SQL models inside a job cluster using dbt-spark. I described this roughly in #575. But when one wants to run a python model, this does not work. For that, I think there should be a new submission method e.g. session which runs python models in the same notebook session (maybe in an extra thread), not through a new notebook, since in a job cluster, the creation of new notebooks seems not to be possible - see below.

Describe alternatives you've considered

I tried the two submission methods all_purpose_cluster and job_cluster but both do not work:

Submission method `all_purpose_cluster`,

When using the submission method all_purpose_cluster, it does not work since it requires authentication information:

14:07:36  Completed with 1 error and 0 warnings:
14:07:36  
14:07:36    Databricks cluster_id is required for all_purpose_cluster submission method with running with notebook.
14:07:36

when providing this with the profiles.yml config like this:

default:
      type: spark
      method: session
      schema: dbt
      host: $DATABRICKS_HOST
      auth_type: oauth
      client_id: $AZURE_CLIENT_ID
      client_secret: $AZURE_CLIENT_SECRET
      threads: 14

It tells you, that the access token is not valid:

14:53:09  
14:53:09    Runtime Error in model DimensionDepartmentTree (models/Finance/DimensionDepartmentTree.py)
  Error creating an execution context.
   b'<html>\n<head>\n<meta http-equiv="Content-Type" content="text/html;charset=utf-8"/>\n<title>Error 403 Invalid access token.</title>\n</head>\n<body><h2>HTTP ERROR 403</h2>\n<p>Problem accessing /api/1.2/contexts/create. Reason:\n<pre>    Invalid access token.</pre></p>\n</body>\n</html>\n'
14:53:09

So, all purpose cluster submission does not work since authentication is (somehow) not possible.

Submission method `job_cluster`

I tried with the submission method job_cluster (which is a bit of ridiculous to start a job cluster from a job cluster for this purpose). It does not work either:

The connection to the API endpoint /api/2.0/workspace/mkdirs does not work:

15:33:35  Unhandled error while executing target/run/databricks_dwh/models/Finance/DimensionDepartmentTree.py
HTTPSConnectionPool(host='na', port=443): Max retries exceeded with url: /api/2.0/workspace/mkdirs (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7fbde6170a90>: Failed to establish a new connection: [Errno -5] No address associated with hostname'))

This is probably because it tries to create the python notebook on the cluster. When I set create_notebook=False, I run again into invalid access token issues. Now the error message for the endpoint is more straight forward:

15:45:06    Runtime Error in model DimensionDepartmentTree (models/Finance/DimensionDepartmentTree.py)
  Error creating work_dir for python notebooks
   b'<html>\n<head>\n<meta http-equiv="Content-Type" content="text/html;charset=utf-8"/>\n<title>Error 403 Invalid access token.</title>\n</head>\n<body><h2>HTTP ERROR 403</h2>\n<p>Problem accessing /api/2.0/workspace/mkdirs. Reason:\n<pre>    Invalid access token.</pre></p>\n</body>\n</html>\n'
15:45:06

Who will this benefit?

Eveybody who want to run python models on a job cluster, e.g. to reduce execution cost.

Are you interested in contributing this feature?

Yes, if I get assistance along the way since I am not so deep in the dbt repositories.

benc-db commented 8 months ago

Python models are in fact the only models that are currently supported for running on job clusters, though its part of an on-going internal debate how much we want to support that. Why are you setting your type as spark instead of databricks?

kmarq commented 8 months ago

Python models are in fact the only models that are currently supported for running on job clusters, though its part of an on-going internal debate how much we want to support that. Why are you setting your type as spark instead of databricks?

Please continue supporting this. It is one of the only ways to have some control over the scaling of each model to a cluster. If everything used interactive we'd end up having to have several different ones with different sizes configured. Its already unfortunate that SQL Warehouses can't scale vertically, at least the new ability to direct models to different warehouses helps.

leo-schick commented 8 months ago

@benc-db I use dbt-spark because dbt-databricks does not allow session based authentication (as far as I can see in the documentation). Session profiles use just pyspark without authentication which makes it possible to run dbt with SQL models inside a job cluster. (Which is a great money saver for us because we do not need a general purpose cluster to run dbt itself) Take a look at #575 as well.

The idea behind this issue is, that I would now like to run python models in the same job cluster without having to create a second/third/… job cluster for each python module (which break, see above) but rather run the python model inside the job cluster notebook. The isolation could still be kept when running it inside an own thread. I understand that a python model run should be as far as possible separated from the runner itself as the documentation says. But in fact, the real work does not (most of the time) happen in the python notebook itself but on the Spark cluster (the runner). So, I would advocate for running the python models - somehow - in the dbt runner (e.g. in a separate thread).

ps: most of my python models could be transformed to SQL models if Spark/Databricks SQL would support ways to:

run a N-loops eg with WITH (to create a parent child tree in SQL)
support Cursors (run one sql query and use the output to run per row output for another sql DDL command with insert into … select …). It must be a cursor because the second select/insert is dependent on the first insert etc. There are - unfortunately - some reasons where you just have to use python models, but - hopefully - most the models people build are based on SQL.

benc-db commented 8 months ago

So are the errors you are reporting here with dbt-databricks or dbt-spark? dbt-databricks does support python on job clusters, though apparently not in your desired approach of reusing the same job cluster. If you have a pitch for how to accomplish what you'd like (take a look at what exists here: https://github.com/databricks/dbt-databricks/blob/main/dbt/adapters/databricks/python_submissions.py) I'm happy to hear it.

leo-schick commented 8 months ago

The error pops up currently in dbt-spark. I would like to use dbt-databricks instead, but that requies #575 to be done.

I think the issue is, that when you run the dbt runner inside a job cluster, one - somehow - cannot interact with the Databricks API (what is currently is done in python_submission.py). It seems that everytime a call against the Databricks API is executed, I get the error messag Error 403 Invalid access token (see samples above). No matter, if we use OAuth or Token based authentication.

I think there are two ways of continuing from here:

Contact Databricks and figure out why we can't use the Databricks API inside a job cluster
We change python_submission.py to be independent of the Databricks API

For the later, I suggest the following (needs PoC testing):

compile the python model and save it locally e.g. in /tmp/dbt-pymodels/<project_name>/<run_id>/...
submit the python model against the session in pyspark in a dedicated thread (don't know if that would work)

To handle the execution in a separate thread, I think the mara-pipelines module is a good example how to do this:

it uses a multiprocessing context for handling the threads
a thread is represented by class TaskProcess.
everytime a task is run, a new instance of TaskProcess is created, invoked and added to a local array in the runner, see here
with a loop in the main thread, it is checked if the TaskProcess is still alive and error handling takes place there, see here

I do not have deep insight how the dbt runner works. Probably there is already a thread/process handling inside the dbt runner which saves the own thread implementation.

This concept idea I draw up above could potentially be implemented in dbt-spark since it does not require the Databricks API or other custom Databricks resources as far as I can see. See https://github.com/dbt-labs/dbt-spark/discussions/407 as well.

github-actions[bot] commented 2 months ago

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please remove the stale label or comment on the issue.

databricks / dbt-databricks