databricks / dbt-databricks

A dbt adapter for Databricks.
https://databricks.com
Apache License 2.0
217 stars 119 forks source link

Generic DBT_ACCESS_TOKEN in dbt Workflow does not have access to catalogs #171

Closed neozenith closed 2 years ago

neozenith commented 2 years ago

Context

Triage Steps

Following the instructions outlined for running dbt-core as a dbt Task Workflow here: https://github.com/databricks/dbt-databricks/blob/main/docs/databricks-workflows.md

✅ Local Dev + SQL Warehouses works

I can get this to work locally when I specify the http_path to be a SQL Warehouse endpoint and I use my PAT injected into the profiles.yml This works because my PAT and the SQL Warehouses are assigned to the correct permissions groups in our workspace

😭 dbt Workflows does not work when...

This is because the cluster generates credentials generically for:

I tried hard coding a PAT that I know works but instead of using the /sql/1.0/endpoints/<endpoint_id> I tried it against the cluster endpoint "sql/protocolv1/o/{{ env_var('DBT_ORG_ID') }}/{{ env_var('DBT_CLUSTER_ID') }}".

This also did not work.

I could go back to having the workflow use the SQL Warehouse Endpoint and a PAT but that defeats the purpose of spinning up a spark cluster for a job only to send the work to another (more expensive) spark cluster right?

Triage Summary

Here is a summary table of test cases

token type endpoint type outcome
PAT SQL Warehouse
Generic Access Token Cluster Endpoint 😭
PAT Cluster Endpoint 😭

Definitions:

Further digging:

It seems the dbt-databricks team have left a TODO with a similar conundrum so we are all on the same page it seems.

https://github.com/databricks/dbt-databricks/blob/main/dbt/adapters/databricks/connections.py#L408-L420

                # TODO: what is the error when a user specifies a catalog they don't have access to
                conn: DatabricksSQLConnection = dbsql.connect(
                    server_hostname=creds.host,

Expected Behaviour

Extra

I do realise that my expectations lie with the Databricks team and not with this library package. However it also makes sense to track this issue in public so that others can follow along as it will undoubtedly link to some code changes too.

I have an active use case I am piloting with a client so very keen to act as a beta tester in our dev environment so there is low risk and we can provide real world feedback.

superdupershant commented 2 years ago

Thanks for the awesome bug report @neozenith! The Workflows UC support is being looked into.

bilalaslamseattle commented 2 years ago

You've set the standard for bug reports. Thanks @neozenith !!

KoningJasper commented 2 years ago

@neozenith This should work and I've verified it to work. A couple of questions to try to debug this:

  1. What kind of cluster is it: Job cluster or Interactive cluster?
  2. What kind of cluster access mode are you using? UC only works on certain modes: https://docs.databricks.com/data-governance/unity-catalog/index.html#cluster-security-mode
  3. What DBR version is this cluster running? It needs to be at least 11.1
  4. Who is the owner of the job / run As?

All this complexity is why we default to running against a SQL warehouse together with a small SingleNode cluster.

neozenith commented 2 years ago

Hi @KoningJasper

Thank you for following up! As per the dot points you have mentioned I tried the following without success:

Configuration

Cluster type: Job Cluster config: Single Node, m5d.large, 0 workers Access Mode: Single User (can't use Shared on a Single Node cluster so did not test) DBR: 11.2 Owner / Run As: josh.peak@mycompany.com (Replace domain with the corporate client I am engaged with)

profile.yml

dbt_poc:
  target: poc
  outputs:
    poc:
      type: databricks
      catalog: hive_metastore
      schema: experiments
      connect_retries: 5
      connect_timeout: 180
      host: "{{ env_var('DBT_HOST') }}"
      http_path: "sql/protocolv1/o/{{ env_var('DBT_ORG_ID') }}/{{ env_var('DBT_CLUSTER_ID') }}"
      token: "{{ env_var('DBT_ACCESS_TOKEN') }}"
      threads: 8

Output

+ dbt debug
22:51:28  Running with dbt=1.2.1
dbt version: 1.2.1
python version: 3.9.5
python path: /local_disk0/.ephemeral_nfs/cluster_libraries/python/bin/python
os info: Linux-5.4.0-1084-aws-x86_64-with-glibc2.31
Using profiles.yml file at /tmp/tmp-dbt-run-768464/<REDACTED-REPO-PATH>/profiles/poc/profiles.yml
Using dbt_project.yml file at /tmp/tmp-dbt-run-768464/<REDACTED-REPO-PATH>/dbt_project.yml

Configuration:
  profiles.yml file [OK found and valid]
  dbt_project.yml file [OK found and valid]

Required dependencies:
 - git [OK found]

Connection:
  host: mycompany.cloud.databricks.com
  http_path: sql/protocolv1/o/<REDACTED-ORD-ID>/<REDACTED-CLUSTER-ID>
  catalog: hive_metastore
  schema: experiments
  Connection test: [ERROR]

1 check failed:
dbt was unable to connect to the specified database.
The database returned the following error:

  >Runtime Error
  Database Error
    failed to connect

Check your database credentials and try again. For more information, visit:
https://docs.getdbt.com/docs/configure-your-profile

Miscellaneous Triage

As mentioned earlier, my user is assigned to a data engineer group that has associated permissions where this same dbt project works successfully against Unity Catalog using SQL Warehouses.

So my user has correct permissions.

Just trying to chain them through a job cluster using the Cluster Endpoint (as documented) instead of SQL Warehouses.

I believe you that you managed to get it to work on your end. I guess now to narrow down the difference between my configuration and your configuration.

KoningJasper commented 2 years ago

@neozenith Ah, that explains it. You cannot run dbt (or any other BI workload) against job clusters, thrift isn't available on them. Could you try with an interactive cluster instead?

We should reject this with a more clear message though. I'll follow-up on that.

neozenith commented 2 years ago

I had the same result with an interactive all-purpose cluster too.

I actually was using that first and the WebTerminal to debug missing profile.yml and dbt_project.yml until we enabled Files in Repo feature.

I have raised a separate issue with our Partners contact that we can't enable the Files in Repo feature on Standard or Premium plans. It is only because the client is on a enterprise plan that the feature toggle appeared. I digress though

I will double check my cluster specs for the all-purpose interactive cluster later tonight but I think it had the same specs.

So the interactive cluster is trying to use the thrift connections... I'll look deeper at those docs too to see if I have missed a configuration.

KoningJasper commented 2 years ago

@neozenith Hmmm that is interesting. I got it to run with this dummy project: https://github.com/KoningJasper/dbt-something. I use the following cluster and job spec:

{
        "name": "dbt-uc",
        "email_notifications": {
            "no_alert_for_skipped_runs": false
        },
        "webhook_notifications": {},
        "timeout_seconds": 0,
        "max_concurrent_runs": 1,
        "tasks": [
            {
                "task_key": "dbt-uc",
                "dbt_task": {
                    "project_directory": "",
                    "commands": [
                        "dbt deps",
                        "dbt run"
                    ]
                },
                "existing_cluster_id": "0927-120709-po31157w",
                "libraries": [
                    {
                        "pypi": {
                            "package": "dbt-databricks>=1.0.0,<2.0.0"
                        }
                    }
                ],
                "timeout_seconds": 0,
                "email_notifications": {}
            }
        ],
        "git_source": {
            "git_url": "https://github.com/KoningJasper/dbt-something.git",
            "git_provider": "gitHub",
            "git_branch": "main"
        },
        "format": "MULTI_TASK"
    }
{
    "num_workers": 0,
    "cluster_name": "jasper.koning@databricks.com's Cluster",
    "spark_version": "11.1.x-scala2.12",
    "spark_conf": {
        "spark.databricks.cluster.profile": "singleNode",
        "spark.master": "local[*, 4]"
    },
    "aws_attributes": {
        "first_on_demand": 1,
        "availability": "SPOT_WITH_FALLBACK",
        "zone_id": "auto",
        "spot_bid_price_percent": 100,
        "ebs_volume_count": 0
    },
    "node_type_id": "i3.xlarge",
    "driver_node_type_id": "i3.xlarge",
    "ssh_public_keys": [],
    "custom_tags": {
        "ResourceClass": "SingleNode"
    },
    "spark_env_vars": {},
    "autotermination_minutes": 120,
    "enable_elastic_disk": true,
    "cluster_source": "UI",
    "init_scripts": [],
    "single_user_name": "jasper.koning@databricks.com",
    "data_security_mode": "SINGLE_USER",
    "runtime_engine": "PHOTON",
    "cluster_id": "0927-120709-po31157w"
}
KoningJasper commented 2 years ago

@neozenith wait... Why is your catalog hive_metastore?

neozenith commented 2 years ago

I will number these for ease of following up:

  1. Just finished dinner so going to have another crack at the all purpose cluster.
  2. I'll give that sample project a go.
  3. I noticed you used DBR 11.1 where as I was using DBR 11.2... so that might be an issue?
  4. We are using hive_metastore since we had existing databases/schemas and that is where they got migrated to when converting to Unity Catalog. This should not be an issue since it worked totally fine with SQL Warehouses but not with the dbt Job
  5. In your sample project and job, what is the output you get from dbt debug instead of running dbt deps dbt run?

Thanks for the reproducible project and job spec 👍 The project I am using is just the example from dbt init so far.

KoningJasper commented 2 years ago

@neozenith It should default to the metastore, could you try it when removing that setting. I get this output:

+ dbt debug
11:31:58  Running with dbt=1.2.1
dbt version: 1.2.1
python version: 3.9.5
python path: /local_disk0/.ephemeral_nfs/cluster_libraries/python/bin/python
os info: Linux-5.4.0-1084-aws-x86_64-with-glibc2.31
Using profiles.yml file at /tmp/tmp-dbt-run-140543184104505/profiles.yml
Using dbt_project.yml file at /tmp/tmp-dbt-run-140543184104505/dbt_project.yml

Configuration:
  profiles.yml file [OK found and valid]
  dbt_project.yml file [OK found and valid]

Required dependencies:
 - git [OK found]

Connection:
  host: e2-dogfood.staging.cloud.databricks.com
  http_path: sql/protocolv1/o/6051921418418893/0927-120709-po31157w
  catalog: lineage
  schema: jasper
  Connection test: [OK connection ok]

All checks passed!
neozenith commented 2 years ago

Ok so tried again with an all purpose cluster using DBR 11.1 with no luck.

Turns out hive_metastore is not yet upgraded to a Unity Catalog database. 😞 So that is likely part of my issue.

Experiment 1

So I tried another database in the default created catalog main:

Experiment 2

So I tried a database that does exist:

Which is super weird because this works fine on our SQL Warehouses... which leads me to think our SQL Warehouses may have more privileges than they should be allowed.

Triage Summary

My next step is to get a notebook attached to this cluster with my user and triage what permissions it has against the catalogs and databases.

It comes down to how my user and the catalog permissions are configured. I will post those triage steps once I figure them out and I'll have to talk to our workspace admin in the morning as they may be able to shed some light.

Looking like the outcome for this issue might be me opening a PR against this guide and adding my troubleshooting steps.

KoningJasper commented 2 years ago

@neozenith I just tried it on another workspace where I'm not admin and it works. 😕 The tmp database already exists and I have CREATE and USAGE permissions to it.

+ dbt debug
12:52:42  Running with dbt=1.2.1
dbt version: 1.2.1
python version: 3.9.5
python path: /local_disk0/.ephemeral_nfs/cluster_libraries/python/bin/python
os info: Linux-5.4.0-1090-azure-x86_64-with-glibc2.31
Using profiles.yml file at /tmp/tmp-dbt-run-734958738/profiles.yml
Using dbt_project.yml file at /tmp/tmp-dbt-run-734958738/dbt_project.yml

Configuration:
  profiles.yml file [OK found and valid]
  dbt_project.yml file [OK found and valid]

Required dependencies:
 - git [OK found]

Connection:
  host: <think>.azuredatabricks.net
  http_path: sql/protocolv1/o/<org-id>/<c-id>
  catalog: main
  schema: tmp
  Connection test: [OK connection ok]

All checks passed!
KoningJasper commented 2 years ago

Hi @neozenith did you have any chance to debug this further and maybe learns some more things. I'm very interesting in finding out the root-cause of this.

neozenith commented 2 years ago

No further progress sorry. The workspace admins have milestones they needed to get out this week so I am going to spend some time with them next week to continue triage.

neozenith commented 2 years ago

Managed to triage this a bit further and thank you for your patience.

Triage

This is the critical triage step:

SHOW GRANTS `josh.peak@mycompany.com` ON SCHEMA hive_metastore.experiments

I ran this on a standard cluster that was failing and also on our SQL Warehouses that were working.

On the SQL Warehouse I could see I was associated with a GROUP that had the USAGE and CREATE permissions.

On the standard cluster I got the following error message:

Error in SQL statement: SparkException: Trying to perform permission action on Hive Metastore /CATALOG/`hive_metastore` but Table Access Control is not enabled on this cluster. 

Next Steps

Enable Table Access Controls for our clusters.

https://docs.databricks.com/security/access-control/table-acls/table-acl.html

Pretty sure this is the missing step that the user is not actually linked to it's permission group in the cluster, but we have the linkage setup correctly in a SQL Warehouse Endpoint.

Happy to close this issue for now.

If I get the time I'll open a PR to update the guide with some triage steps once we get this working, then just reference this as the original issue.