Open leo-schick opened 8 months ago
Python models are in fact the only models that are currently supported for running on job clusters, though its part of an on-going internal debate how much we want to support that. Why are you setting your type as spark instead of databricks?
Python models are in fact the only models that are currently supported for running on job clusters, though its part of an on-going internal debate how much we want to support that. Why are you setting your type as spark instead of databricks?
Please continue supporting this. It is one of the only ways to have some control over the scaling of each model to a cluster. If everything used interactive we'd end up having to have several different ones with different sizes configured. Its already unfortunate that SQL Warehouses can't scale vertically, at least the new ability to direct models to different warehouses helps.
@benc-db I use dbt-spark because dbt-databricks does not allow session based authentication (as far as I can see in the documentation). Session profiles use just pyspark without authentication which makes it possible to run dbt with SQL models inside a job cluster. (Which is a great money saver for us because we do not need a general purpose cluster to run dbt itself) Take a look at #575 as well.
The idea behind this issue is, that I would now like to run python models in the same job cluster without having to create a second/third/… job cluster for each python module (which break, see above) but rather run the python model inside the job cluster notebook. The isolation could still be kept when running it inside an own thread. I understand that a python model run should be as far as possible separated from the runner itself as the documentation says. But in fact, the real work does not (most of the time) happen in the python notebook itself but on the Spark cluster (the runner). So, I would advocate for running the python models - somehow - in the dbt runner (e.g. in a separate thread).
ps: most of my python models could be transformed to SQL models if Spark/Databricks SQL would support ways to:
So are the errors you are reporting here with dbt-databricks or dbt-spark? dbt-databricks does support python on job clusters, though apparently not in your desired approach of reusing the same job cluster. If you have a pitch for how to accomplish what you'd like (take a look at what exists here: https://github.com/databricks/dbt-databricks/blob/main/dbt/adapters/databricks/python_submissions.py) I'm happy to hear it.
The error pops up currently in dbt-spark. I would like to use dbt-databricks instead, but that requies #575 to be done.
I think the issue is, that when you run the dbt runner inside a job cluster, one - somehow - cannot interact with the Databricks API (what is currently is done in python_submission.py
). It seems that everytime a call against the Databricks API is executed, I get the error messag Error 403 Invalid access token
(see samples above). No matter, if we use OAuth or Token based authentication.
I think there are two ways of continuing from here:
python_submission.py
to be independent of the Databricks APIFor the later, I suggest the following (needs PoC testing):
/tmp/dbt-pymodels/<project_name>/<run_id>/...
pyspark
in a dedicated thread (don't know if that would work)To handle the execution in a separate thread, I think the mara-pipelines module is a good example how to do this:
TaskProcess
is created, invoked and added to a local array in the runner, see hereI do not have deep insight how the dbt runner works. Probably there is already a thread/process handling inside the dbt runner which saves the own thread implementation.
This concept idea I draw up above could potentially be implemented in dbt-spark
since it does not require the Databricks API or other custom Databricks resources as far as I can see. See https://github.com/dbt-labs/dbt-spark/discussions/407 as well.
This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please remove the stale label or comment on the issue.
Describe the feature
Currently, there is no way (at least that I know) to run a python models inside a job cluster. It is possible to run SQL models inside a job cluster using dbt-spark. I described this roughly in #575. But when one wants to run a python model, this does not work. For that, I think there should be a new submission method e.g.
session
which runs python models in the same notebook session (maybe in an extra thread), not through a new notebook, since in a job cluster, the creation of new notebooks seems not to be possible - see below.Describe alternatives you've considered
I tried the two submission methods
all_purpose_cluster
andjob_cluster
but both do not work:Submission method
all_purpose_cluster
,When using the submission method
all_purpose_cluster
, it does not work since it requires authentication information:when providing this with the profiles.yml config like this:
It tells you, that the access token is not valid:
So, all purpose cluster submission does not work since authentication is (somehow) not possible.
Submission method
job_cluster
I tried with the submission method
job_cluster
(which is a bit of ridiculous to start a job cluster from a job cluster for this purpose). It does not work either:The connection to the API endpoint
/api/2.0/workspace/mkdirs
does not work:This is probably because it tries to create the python notebook on the cluster. When I set
create_notebook=False
, I run again into invalid access token issues. Now the error message for the endpoint is more straight forward:Who will this benefit?
Eveybody who want to run python models on a job cluster, e.g. to reduce execution cost.
Are you interested in contributing this feature?
Yes, if I get assistance along the way since I am not so deep in the dbt repositories.