Closed neozenith closed 2 years ago
Thanks for the awesome bug report @neozenith! The Workflows UC support is being looked into.
You've set the standard for bug reports. Thanks @neozenith !!
@neozenith This should work and I've verified it to work. A couple of questions to try to debug this:
All this complexity is why we default to running against a SQL warehouse together with a small SingleNode cluster.
Hi @KoningJasper
Thank you for following up! As per the dot points you have mentioned I tried the following without success:
Cluster type: Job Cluster config: Single Node, m5d.large, 0 workers Access Mode: Single User (can't use Shared on a Single Node cluster so did not test) DBR: 11.2 Owner / Run As: josh.peak@mycompany.com (Replace domain with the corporate client I am engaged with)
profile.yml
dbt_poc:
target: poc
outputs:
poc:
type: databricks
catalog: hive_metastore
schema: experiments
connect_retries: 5
connect_timeout: 180
host: "{{ env_var('DBT_HOST') }}"
http_path: "sql/protocolv1/o/{{ env_var('DBT_ORG_ID') }}/{{ env_var('DBT_CLUSTER_ID') }}"
token: "{{ env_var('DBT_ACCESS_TOKEN') }}"
threads: 8
+ dbt debug
22:51:28 Running with dbt=1.2.1
dbt version: 1.2.1
python version: 3.9.5
python path: /local_disk0/.ephemeral_nfs/cluster_libraries/python/bin/python
os info: Linux-5.4.0-1084-aws-x86_64-with-glibc2.31
Using profiles.yml file at /tmp/tmp-dbt-run-768464/<REDACTED-REPO-PATH>/profiles/poc/profiles.yml
Using dbt_project.yml file at /tmp/tmp-dbt-run-768464/<REDACTED-REPO-PATH>/dbt_project.yml
Configuration:
profiles.yml file [OK found and valid]
dbt_project.yml file [OK found and valid]
Required dependencies:
- git [OK found]
Connection:
host: mycompany.cloud.databricks.com
http_path: sql/protocolv1/o/<REDACTED-ORD-ID>/<REDACTED-CLUSTER-ID>
catalog: hive_metastore
schema: experiments
Connection test: [ERROR]
1 check failed:
dbt was unable to connect to the specified database.
The database returned the following error:
>Runtime Error
Database Error
failed to connect
Check your database credentials and try again. For more information, visit:
https://docs.getdbt.com/docs/configure-your-profile
As mentioned earlier, my user is assigned to a data engineer
group that has associated permissions where this same dbt project works successfully against Unity Catalog using SQL Warehouses.
So my user has correct permissions.
Just trying to chain them through a job cluster using the Cluster Endpoint (as documented) instead of SQL Warehouses.
I believe you that you managed to get it to work on your end. I guess now to narrow down the difference between my configuration and your configuration.
@neozenith Ah, that explains it. You cannot run dbt (or any other BI workload) against job clusters, thrift isn't available on them. Could you try with an interactive cluster instead?
We should reject this with a more clear message though. I'll follow-up on that.
I had the same result with an interactive all-purpose cluster too.
I actually was using that first and the WebTerminal to debug missing profile.yml
and dbt_project.yml
until we enabled Files in Repo feature.
I have raised a separate issue with our Partners contact that we can't enable the Files in Repo feature on Standard or Premium plans. It is only because the client is on a enterprise plan that the feature toggle appeared. I digress though
I will double check my cluster specs for the all-purpose interactive cluster later tonight but I think it had the same specs.
So the interactive cluster is trying to use the thrift connections... I'll look deeper at those docs too to see if I have missed a configuration.
@neozenith Hmmm that is interesting. I got it to run with this dummy project: https://github.com/KoningJasper/dbt-something. I use the following cluster and job spec:
{
"name": "dbt-uc",
"email_notifications": {
"no_alert_for_skipped_runs": false
},
"webhook_notifications": {},
"timeout_seconds": 0,
"max_concurrent_runs": 1,
"tasks": [
{
"task_key": "dbt-uc",
"dbt_task": {
"project_directory": "",
"commands": [
"dbt deps",
"dbt run"
]
},
"existing_cluster_id": "0927-120709-po31157w",
"libraries": [
{
"pypi": {
"package": "dbt-databricks>=1.0.0,<2.0.0"
}
}
],
"timeout_seconds": 0,
"email_notifications": {}
}
],
"git_source": {
"git_url": "https://github.com/KoningJasper/dbt-something.git",
"git_provider": "gitHub",
"git_branch": "main"
},
"format": "MULTI_TASK"
}
{
"num_workers": 0,
"cluster_name": "jasper.koning@databricks.com's Cluster",
"spark_version": "11.1.x-scala2.12",
"spark_conf": {
"spark.databricks.cluster.profile": "singleNode",
"spark.master": "local[*, 4]"
},
"aws_attributes": {
"first_on_demand": 1,
"availability": "SPOT_WITH_FALLBACK",
"zone_id": "auto",
"spot_bid_price_percent": 100,
"ebs_volume_count": 0
},
"node_type_id": "i3.xlarge",
"driver_node_type_id": "i3.xlarge",
"ssh_public_keys": [],
"custom_tags": {
"ResourceClass": "SingleNode"
},
"spark_env_vars": {},
"autotermination_minutes": 120,
"enable_elastic_disk": true,
"cluster_source": "UI",
"init_scripts": [],
"single_user_name": "jasper.koning@databricks.com",
"data_security_mode": "SINGLE_USER",
"runtime_engine": "PHOTON",
"cluster_id": "0927-120709-po31157w"
}
@neozenith wait... Why is your catalog hive_metastore
?
I will number these for ease of following up:
hive_metastore
since we had existing databases/schemas and that is where they got migrated to when converting to Unity Catalog. This should not be an issue since it worked totally fine with SQL Warehouses but not with the dbt Jobdbt debug
instead of running dbt deps
dbt run
?Thanks for the reproducible project and job spec 👍
The project I am using is just the example from dbt init
so far.
@neozenith It should default to the metastore, could you try it when removing that setting. I get this output:
+ dbt debug
11:31:58 Running with dbt=1.2.1
dbt version: 1.2.1
python version: 3.9.5
python path: /local_disk0/.ephemeral_nfs/cluster_libraries/python/bin/python
os info: Linux-5.4.0-1084-aws-x86_64-with-glibc2.31
Using profiles.yml file at /tmp/tmp-dbt-run-140543184104505/profiles.yml
Using dbt_project.yml file at /tmp/tmp-dbt-run-140543184104505/dbt_project.yml
Configuration:
profiles.yml file [OK found and valid]
dbt_project.yml file [OK found and valid]
Required dependencies:
- git [OK found]
Connection:
host: e2-dogfood.staging.cloud.databricks.com
http_path: sql/protocolv1/o/6051921418418893/0927-120709-po31157w
catalog: lineage
schema: jasper
Connection test: [OK connection ok]
All checks passed!
Ok so tried again with an all purpose cluster using DBR 11.1 with no luck.
Turns out hive_metastore
is not yet upgraded to a Unity Catalog database. 😞 So that is likely part of my issue.
Experiment 1
So I tried another database in the default created catalog main
:
catalog: main
database: experiments
Still nothing, possibly because my user does not have CREATE
permissions to create databases.Experiment 2
So I tried a database that does exist:
catalog: main
database: default
Also no luck...Which is super weird because this works fine on our SQL Warehouses... which leads me to think our SQL Warehouses may have more privileges than they should be allowed.
profiles.yml
that you doMy next step is to get a notebook attached to this cluster with my user and triage what permissions it has against the catalogs and databases.
It comes down to how my user and the catalog permissions are configured. I will post those triage steps once I figure them out and I'll have to talk to our workspace admin in the morning as they may be able to shed some light.
Looking like the outcome for this issue might be me opening a PR against this guide and adding my troubleshooting steps.
@neozenith I just tried it on another workspace where I'm not admin and it works. 😕 The tmp
database already exists and I have CREATE
and USAGE
permissions to it.
+ dbt debug
12:52:42 Running with dbt=1.2.1
dbt version: 1.2.1
python version: 3.9.5
python path: /local_disk0/.ephemeral_nfs/cluster_libraries/python/bin/python
os info: Linux-5.4.0-1090-azure-x86_64-with-glibc2.31
Using profiles.yml file at /tmp/tmp-dbt-run-734958738/profiles.yml
Using dbt_project.yml file at /tmp/tmp-dbt-run-734958738/dbt_project.yml
Configuration:
profiles.yml file [OK found and valid]
dbt_project.yml file [OK found and valid]
Required dependencies:
- git [OK found]
Connection:
host: <think>.azuredatabricks.net
http_path: sql/protocolv1/o/<org-id>/<c-id>
catalog: main
schema: tmp
Connection test: [OK connection ok]
All checks passed!
Hi @neozenith did you have any chance to debug this further and maybe learns some more things. I'm very interesting in finding out the root-cause of this.
No further progress sorry. The workspace admins have milestones they needed to get out this week so I am going to spend some time with them next week to continue triage.
Managed to triage this a bit further and thank you for your patience.
This is the critical triage step:
SHOW GRANTS `josh.peak@mycompany.com` ON SCHEMA hive_metastore.experiments
I ran this on a standard cluster that was failing and also on our SQL Warehouses that were working.
On the SQL Warehouse I could see I was associated with a GROUP
that had the USAGE
and CREATE
permissions.
On the standard cluster I got the following error message:
Error in SQL statement: SparkException: Trying to perform permission action on Hive Metastore /CATALOG/`hive_metastore` but Table Access Control is not enabled on this cluster.
Enable Table Access Controls for our clusters.
https://docs.databricks.com/security/access-control/table-acls/table-acl.html
Pretty sure this is the missing step that the user is not actually linked to it's permission group in the cluster, but we have the linkage setup correctly in a SQL Warehouse Endpoint.
Happy to close this issue for now.
If I get the time I'll open a PR to update the guide with some triage steps once we get this working, then just reference this as the original issue.
Context
Triage Steps
Following the instructions outlined for running
dbt-core
as adbt Task
Workflow here: https://github.com/databricks/dbt-databricks/blob/main/docs/databricks-workflows.md✅ Local Dev + SQL Warehouses works
I can get this to work locally when I specify the
http_path
to be a SQL Warehouse endpoint and I use my PAT injected into theprofiles.yml
This works because my PAT and the SQL Warehouses are assigned to the correct permissions groups in our workspace😭 dbt Workflows does not work when...
This is because the cluster generates credentials generically for:
DBT_HOST
,DBT_ORG_ID
,DBT_CLUSTER_ID
,DBT_ACCESS_TOKEN
It seems that access token does not have the correct permission group associated and there is no way to add the association.
I tried hard coding a PAT that I know works but instead of using the
/sql/1.0/endpoints/<endpoint_id>
I tried it against the cluster endpoint"sql/protocolv1/o/{{ env_var('DBT_ORG_ID') }}/{{ env_var('DBT_CLUSTER_ID') }}"
.This also did not work.
I could go back to having the workflow use the SQL Warehouse Endpoint and a PAT but that defeats the purpose of spinning up a spark cluster for a job only to send the work to another (more expensive) spark cluster right?
Triage Summary
Here is a summary table of test cases
Definitions:
DBT_ACCESS_TOKEN
/sql/1.0/endpoints/<endpoint_id>
"sql/protocolv1/o/{{ env_var('DBT_ORG_ID') }}/{{ env_var('DBT_CLUSTER_ID') }}"
Further digging:
It seems the
dbt-databricks
team have left aTODO
with a similar conundrum so we are all on the same page it seems.https://github.com/databricks/dbt-databricks/blob/main/dbt/adapters/databricks/connections.py#L408-L420
Expected Behaviour
DBT_ACCESS_TOKEN
is generated.Extra
I do realise that my expectations lie with the Databricks team and not with this library package. However it also makes sense to track this issue in public so that others can follow along as it will undoubtedly link to some code changes too.
I have an active use case I am piloting with a client so very keen to act as a beta tester in our dev environment so there is low risk and we can provide real world feedback.