databrickslabs / dbx

🧱 Databricks CLI eXtensions - aka dbx is a CLI tool for development and advanced Databricks workflows management.
https://dbx.readthedocs.io
Other
437 stars 119 forks source link

dbx does not use credential passthrough #864

Open mathurk1 opened 2 months ago

mathurk1 commented 2 months ago

Expected Behavior

I am working with Azure Databricks. I have a cluster with credential passthrough which allows me to read data stored in ADLS gen2 using my own id. I can simply log into databricks workspace, attach a notebook to the cluster and query the delta tables from ADLS gen2 without any setup.

I would expect that when I submit dbx execute --cluster-id cluster123 --job jobABC to the same cluster, it should be able to read those datasets from ADLS gen2 using my ID.

Thanks!

Current Behavior

Currently, the job fails when I dbx execute a job to the cluster with the following error:

Py4JJavaError: An error occurred while calling o469.load.
: com.databricks.backend.daemon.data.client.adl.AzureCredentialNotFoundException: Could not find ADLS Gen2 Token
        at com.databricks.backend.daemon.data.client.adl.AdlGen2UpgradeCredentialContextTokenProvider.$anonfun$getToken$1(AdlGen2UpgradeCredentialContextTokenProvider.scala:37)
        at scala.Option.getOrElse(Option.scala:189)
        at com.databricks.backend.daemon.data.client.adl.AdlGen2UpgradeCredentialContextTokenProvider.getToken(AdlGen2UpgradeCredentialContextTokenProvider.scala:31)
        at shaded.databricks.azurebfs.org.apache.hadoop.fs.azurebfs.services.AbfsClient.getAccessToken(AbfsClient.java:1371)
        at shaded.databricks.azurebfs.org.apache.hadoop.fs.azurebfs.services.AbfsRestOperation.executeHttpOperation(AbfsRestOperation.java:306)
        at shaded.databricks.azurebfs.org.apache.hadoop.fs.azurebfs.services.AbfsRestOperation.completeExecute(AbfsRestOperation.java:238)
        at shaded.databricks.azurebfs.org.apache.hadoop.fs.azurebfs.services.AbfsRestOperation.lambda$execute$0(AbfsRestOperation.java:211)
        at org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.trackDurationOfInvocation(IOStatisticsBinding.java:464)
        at shaded.databricks.azurebfs.org.apache.hadoop.fs.azurebfs.services.AbfsRestOperation.execute(AbfsRestOperation.java:209)
        at shaded.databricks.azurebfs.org.apache.hadoop.fs.azurebfs.services.AbfsClient.getAclStatus(AbfsClient.java:1213)
        at shaded.databricks.azurebfs.org.apache.hadoop.fs.azurebfs.services.AbfsClient.getAclStatus(AbfsClient.java:1194)
        at shaded.databricks.azurebfs.org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.getIsNamespaceEnabled(AzureBlobFileSystemStore.java:437)
        at shaded.databricks.azurebfs.org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.getFileStatus(AzureBlobFileSystemStore.java:1107)
        at shaded.databricks.azurebfs.org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem.getFileStatus(AzureBlobFileSystem.java:901)
        at shaded.databricks.azurebfs.org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem.getFileStatus(AzureBlobFileSystem.java:891)

From my understanding, it is expecting a service principal or storeage keys to be configured

Steps to Reproduce (for bugs)

  1. clone charming aurora repo - https://github.com/gstaubli/dbx-charming-aurora
  2. setup dbx configure --token to setup link with databricks workspace
  3. add a new job to the conf/deployment.yml file:
      - name: "my-test-job"
        spark_python_task:
          python_file: "file://charming_aurora/tasks/sample_etl_task.py"
          parameters: [ "--conf-file", "file:fuse://conf/tasks/sample_etl_config.yml" ]
  4. update the sample etl task to read a adls delta table - https://github.com/gstaubli/dbx-charming-aurora/blob/main/charming_aurora/tasks/sample_etl_task.py
    def _write_data(self):
        df = (
            self.spark.read.format("delta")
            .load(
                f"abfss://containername@storeageaccount.dfs.core.windows.net/path/to/table"
            )
            .filter(f.col("date") == "2024-01-01")
        )
        print(df.count())
  5. submit job - dbx execute --cluster-id=cluster-id-with-credential-passthrough --job my-test-job

Context

I want to specifically "dbx execute" to my interactive cluster and not create a job cluster.

Your Environment