databricks / databricks-sdk-go

Databricks SDK for Go
https://docs.databricks.com/dev-tools/sdk-go.html
Apache License 2.0
51 stars 42 forks source link

Unable to use Workflow Identity Federation from Azure DevOps pipeline #1025

Open Pim-Mostert opened 3 months ago

Pim-Mostert commented 3 months ago

Describe the issue

I want to deploy a Databricks Asset Bundle from an Azure Pipeline using databricks cli. While authentication for the cli itself seems to work, the actual deployment does not. It appears that the underlying Terraform provider is not able to authenticate.

The issue in particular appears to arise from our DevOps service connection. The service connection is configured for Workload Identity Federation. When I try an old service connection that authenticates using client credentials, the deployment succeeds.

I suspect the bug may be fixed by simply upgrading the version of Terraform that databricks cli uses under the hood. Currently it uses Terraform 1.5.5. Newer versions of Terraform seems to support the Workload Identity Federation flow. See https://developer.hashicorp.com/terraform/language/settings/backends/azurerm, but note how version 1.5.x of that same page makes no mention of Workload Identity Federation.

Relevant documentation:

Configuration

# azure-pipelines.yml
variables:
  databricksHost: "https://adb-XXX.azuredatabricks.net"

pool:
  vmImage: "ubuntu-latest"

jobs:
  - job: databricks_asset_bundle
    displayName: "Deploy Databricks Asset Bundle"
    steps:
      - bash: |
          # Install Databricks CLI - see https://learn.microsoft.com/en-us/azure/databricks/dev-tools/ci-cd/ci-cd-azure-devops
          curl -fsSL https://raw.githubusercontent.com/databricks/setup-cli/main/install.sh | sh

          # Verify installation
          databricks --version

          # Create databricks config file
          file="~/.databrickscfg"

          if [ -f "$file" ] ; then
              rm "$file"
          fi        

          echo "[DEFAULT]" >> ~/.databrickscfg
          echo "host = $databricksHost" >> ~/.databrickscfg
        displayName: Setup Databricks CLI
      - task: AzureCLI@2
        displayName: Deploy Asset Bundle
        inputs:
          azureSubscription: "my-workload-identity-federation-service-connection"
          addSpnToEnvironment: true
          scriptType: "bash"
          scriptLocation: "inlineScript"
          inlineScript: |
            # As described in https://devblogs.microsoft.com/devops/public-preview-of-workload-identity-federation-for-azure-pipelines/
            export ARM_CLIENT_ID=$servicePrincipalId
            export ARM_OIDC_TOKEN=$idToken
            export ARM_TENANT_ID=$tenantId
            export ARM_SUBSCRIPTION_ID=$(az account show --query id -o tsv)
            export ARM_USE_OIDC=true

            # Databricks authentication itself works fine
            echo ------------- List experiments -------------
            databricks experiments list-experiments

            # But bundle deployment does not
            echo ------------- Deploy bundle -------------
            databricks bundle deploy --log-level=debug --target dev

I have tried various combinations of the ARM_ environment variables above, but I couldn't find a working combination.

What did work was using a service principal service connection, in combination with:

          addSpnToEnvironment: true
          inlineScript: |
            export ARM_CLIENT_ID=$servicePrincipalId
            export ARM_TENANT_ID=$tenantId
            export ARM_SUBSCRIPTION_ID=$(az account show --query id -o tsv)
            export ARM_CLIENT_SECRET=$servicePrincipalKey
# databricks.yml
bundle:
  name: my_project

variables:
  service_principle:
    description: Service principle used by the DevOps agent
    default: my-service-principle-id

run_as:
  service_principal_name: ${var.service_principle}

# Example resources to deploy
resources:
  experiments:
    my_experiment:
      name: "/Workspace/Users/${var.service_principle}/my_experiment"

targets:
  dev:
    mode: production
    default: true
    workspace:
      host: https://adb-XXX.azuredatabricks.net

Steps to reproduce the behavior

  1. Create a DevOps service connection with Workflow Identity Federation
  2. Create an Azure Pipeline with above yml (replace placeholders), using the service connection from 1)
  3. Create Databricks Asset Bundle with above above yml (replace placeholders)
  4. Trigger pipeline
  5. Observe error

Expected Behavior

The deployment of the asset bundle should succeed.

Actual Behavior

The following error appears in the pipeline's log:

------------- Deploy bundle -------------
2024/08/27 08:40:59 [DEBUG] GET https://releases.hashicorp.com/terraform/1.5.5/index.json
2024/08/27 08:40:59 [DEBUG] GET https://releases.hashicorp.com/terraform/1.5.5/terraform_1.5.5_SHA256SUMS.72D7468F.sig
2024/08/27 08:40:59 [DEBUG] GET https://releases.hashicorp.com/terraform/1.5.5/terraform_1.5.5_SHA256SUMS
2024/08/27 08:40:59 [DEBUG] GET https://releases.hashicorp.com/terraform/1.5.5/terraform_1.5.5_linux_amd64.zip
Uploading bundle files to /Users/***/.bundle/my_project/dev/files...
Deploying resources...
Updating deployment state...
Deployment complete!
Error: terraform apply: exit status 1

Error: cannot create mlflow experiment: failed during request visitor: default auth: azure-cli: cannot get access token: ERROR: Please run 'az login' to setup account.
. Config: host=https://adb-XXX.azuredatabricks.net,/ azure_client_id=***, azure_tenant_id=XXX. Env: DATABRICKS_HOST, ARM_CLIENT_ID, ARM_TENANT_ID

  with databricks_mlflow_experiment.main,
  on bundle.tf.json line 17, in resource.databricks_mlflow_experiment.main:
  17:       }

Note that the listing of experiments works fine:

------------- List experiments -------------
[
   (expected list of experiments, redacted)
  {
      ...
  },
  ...
]

OS and CLI version

Output by the Azure pipeline:

azure-cli                         2.63.0

core                              2.63.0
telemetry                          1.1.0

Extensions:
azure-devops                       1.0.1

Dependencies:
msal                              1.30.0
azure-mgmt-resource               23.1.1

Databricks CLI: v0.227.0

OS: Ubuntu (Microsoft-hosted agent, latest version)

Is this a regression?

I don't know, I'm new to Databricks.

Debug Logs

See attachment. debug_logs.txt

andrewnester commented 3 months ago

Databricks CLI depends on Databricks Go SDK which recently added support for OIDC, see this:

https://github.com/databricks/databricks-sdk-go/pull/965/files#diff-fe44f09c4d5977b5f5eaea29170b6a0748819c9d02271746a20d81a5f3efca17

The configuration you need to provide though is ACTIONS_ID_TOKEN_REQUEST_URL and ACTIONS_ID_TOKEN_REQUEST_TOKEN

Please try to change you GitHub actions setup to use these variables and see if it works

Pim-Mostert commented 3 months ago

@andrewnester Thanks for your reply. I'm not using GitHub actions though, but Azure DevOpes Pipelines. It appears your solution applies specifically to GitHub actions (see e.g. https://library.tf/providers/microsoft/azuredevops/latest/docs/guides/authenticating_service_principal_using_an_oidc_token). For Azure Pipelines, the above page mentions the variables ARM_TENANT_ID, ARM_CLIENT_ID, and ARM_OIDC_TOKEN. These are indeed the ones I have tried and do not work.

andrewnester commented 3 months ago

Ah, indeed, I see. In this case, Go SDK we rely on for authentication is not yet supporting OIDC for Azure pipelines. I'm moving this issue to Go SDK as a feature request

andrewnester commented 3 months ago

Also it seems to be related to this feature request https://github.com/databricks/databricks-sdk-go/issues/495

andrewnester commented 3 months ago

@Pim-Mostert what is surprising is that CLI commands work for you, could you try to run this command with --log-level TRACE flag and provide an output?

 databricks experiments list-experiments --log-level TRACE
Pim-Mostert commented 3 months ago

@andrewnester Sure:

2024-08-27T11:01:55.1358321Z ------------- List experiments -------------
2024-08-27T11:01:55.1480761Z 11:01:55  INFO start pid=1874 version=0.227.0 args="databricks, experiments, list-experiments, --log-level, TRACE"
2024-08-27T11:01:55.1486300Z 11:01:55 DEBUG Found bundle root at /home/vsts/work/1/s (file /home/vsts/work/1/s/databricks.yml) pid=1874
2024-08-27T11:01:55.1489766Z 11:01:55 DEBUG Apply pid=1874 mutator=load
2024-08-27T11:01:55.1494173Z 11:01:55  INFO Phase: load pid=1874 mutator=load
2024-08-27T11:01:55.1497169Z 11:01:55 DEBUG Apply pid=1874 mutator=load mutator=seq
2024-08-27T11:01:55.1501801Z 11:01:55 DEBUG Apply pid=1874 mutator=load mutator=seq mutator=EntryPoint
2024-08-27T11:01:55.1508168Z 11:01:55 DEBUG Apply pid=1874 mutator=load mutator=seq mutator=scripts.preinit
2024-08-27T11:01:55.1512861Z 11:01:55 DEBUG No script defined for preinit, skipping pid=1874 mutator=load mutator=seq mutator=scripts.preinit
2024-08-27T11:01:55.1516555Z 11:01:55 DEBUG Apply pid=1874 mutator=load mutator=seq mutator=ProcessRootIncludes
2024-08-27T11:01:55.1520538Z 11:01:55 DEBUG Apply pid=1874 mutator=load mutator=seq mutator=ProcessRootIncludes mutator=seq
2024-08-27T11:01:55.1524682Z 11:01:55 DEBUG Apply pid=1874 mutator=load mutator=seq mutator=VerifyCliVersion
2024-08-27T11:01:55.1528227Z 11:01:55 DEBUG Apply pid=1874 mutator=load mutator=seq mutator=EnvironmentsToTargets
2024-08-27T11:01:55.1531938Z 11:01:55 DEBUG Apply pid=1874 mutator=load mutator=seq mutator=InitializeVariables
2024-08-27T11:01:55.1541062Z 11:01:55 DEBUG Apply pid=1874 mutator=load mutator=seq mutator=DefineDefaultTarget(default)
2024-08-27T11:01:55.1544626Z 11:01:55 DEBUG Apply pid=1874 mutator=load mutator=seq mutator=LoadGitDetails
2024-08-27T11:01:55.1550812Z 11:01:55 DEBUG Apply pid=1874 mutator=load mutator=seq mutator=PythonMutator(load)
2024-08-27T11:01:55.1554405Z 11:01:55 DEBUG Apply pid=1874 mutator=load mutator=seq mutator=validate:unique_resource_keys
2024-08-27T11:01:55.1558241Z 11:01:55 DEBUG Apply pid=1874 mutator=load mutator=seq mutator=SelectDefaultTarget
2024-08-27T11:01:55.1561794Z 11:01:55 DEBUG Apply pid=1874 mutator=load mutator=seq mutator=SelectDefaultTarget mutator=SelectTarget(dev)
2024-08-27T11:01:55.1566466Z 11:01:55 TRACE Loading config via environment pid=1874 sdk=true
2024-08-27T11:01:55.1569451Z 11:01:55 TRACE Loading config via resolve-profile-from-host pid=1874 sdk=true
2024-08-27T11:01:55.1573469Z 11:01:55 TRACE Attempting to configure auth: pat pid=1874 sdk=true
2024-08-27T11:01:55.1575987Z 11:01:55 TRACE Attempting to configure auth: basic pid=1874 sdk=true
2024-08-27T11:01:55.1579964Z 11:01:55 TRACE Attempting to configure auth: oauth-m2m pid=1874 sdk=true
2024-08-27T11:01:55.1582931Z 11:01:55 TRACE Attempting to configure auth: databricks-cli pid=1874 sdk=true
2024-08-27T11:01:55.1586293Z 11:01:55 DEBUG Running command: /usr/local/bin/databricks auth token --host https://adb-XXX.azuredatabricks.net pid=1874 sdk=true
2024-08-27T11:01:55.1708113Z 11:01:55 TRACE Attempting to configure auth: metadata-service pid=1874 sdk=true
2024-08-27T11:01:55.1708871Z 11:01:55 TRACE Attempting to configure auth: github-oidc-azure pid=1874 sdk=true
2024-08-27T11:01:55.1709975Z 11:01:55 DEBUG Missing cfg.ActionsIDTokenRequestURL, likely not calling from a Github action pid=1874 sdk=true
2024-08-27T11:01:55.1710641Z 11:01:55 TRACE Attempting to configure auth: azure-msi pid=1874 sdk=true
2024-08-27T11:01:55.1711667Z 11:01:55 TRACE Attempting to configure auth: azure-client-secret pid=1874 sdk=true
2024-08-27T11:01:55.1712348Z 11:01:55 TRACE Attempting to configure auth: azure-cli pid=1874 sdk=true
2024-08-27T11:01:55.1713620Z 11:01:55 DEBUG Running command: az account get-access-token --resource 2ff814a6-3304-4ab8-85cb-cd0e6f879c1d --output json --tenant XXX pid=1874 sdk=true
2024-08-27T11:01:55.8769849Z 11:01:55  INFO Refreshed OAuth token for 2ff814a6-3304-4ab8-85cb-cd0e6f879c1d for tenant XXX from Azure CLI, which expires on 2024-08-27 12:01:54.000000 pid=1874 sdk=true
2024-08-27T11:01:55.8777969Z 11:01:55 DEBUG Running command: az account get-access-token --resource https://management.core.windows.net/ --output json --tenant XXX pid=1874 sdk=true
2024-08-27T11:01:56.3681975Z 11:01:56  INFO Refreshed OAuth token for https://management.core.windows.net/ for tenant XXX from Azure CLI, which expires on 2024-08-27 12:01:51.000000 pid=1874 sdk=true
2024-08-27T11:01:56.3688434Z 11:01:56  INFO Using Azure CLI authentication with AAD tokens pid=1874 sdk=true
2024-08-27T11:01:57.6325404Z 11:01:57 DEBUG GET /api/2.0/mlflow/experiments/list
2024-08-27T11:01:57.6326186Z < HTTP/2.0 200 OK
2024-08-27T11:01:57.6326679Z < {
2024-08-27T11:01:57.6326983Z <   "experiments": [
2024-08-27T11:01:57.6327560Z <     {
                                    REDACTED
2024-08-27T11:01:57.6433239Z <     "... (5 additional elements)"
2024-08-27T11:01:57.6433371Z <   ]
2024-08-27T11:01:57.6433481Z < } pid=1874 sdk=true
2024-08-27T11:01:57.6433605Z [
2024-08-27T11:01:57.6433699Z   {
2024-08-27T11:01:57.6433962Z     "artifact_location":XXX
2024-08-27T11:01:57.6434160Z     "creation_time": XXX
2024-08-27T11:01:57.6434310Z     "experiment_id": XXX
2024-08-27T11:01:57.6434477Z     "last_update_time": XXX
2024-08-27T11:01:57.6434691Z     "lifecycle_stage": XXX
2024-08-27T11:01:57.6435001Z     "name": XXX
2024-08-27T11:01:57.6435201Z     "tags": [
2024-08-27T11:01:57.6435303Z       {
2024-08-27T11:01:57.6435453Z         "key": "mlflow.experiment.sourceName",
2024-08-27T11:01:57.6435772Z         "value": XXX
2024-08-27T11:01:57.6435968Z       },
2024-08-27T11:01:57.6436080Z       {
2024-08-27T11:01:57.6436193Z         "key": "mlflow.ownerId",
2024-08-27T11:01:57.6436341Z         "value": "587479253565293"
2024-08-27T11:01:57.6436459Z       },
2024-08-27T11:01:57.6436639Z       {
2024-08-27T11:01:57.6436756Z         "key": "mlflow.ownerEmail",
2024-08-27T11:01:57.6436923Z         "value": "XXX"
2024-08-27T11:01:57.6437069Z       },
2024-08-27T11:01:57.6437166Z       {
2024-08-27T11:01:57.6437299Z         "key": "mlflow.experimentType",
2024-08-27T11:01:57.6437433Z         "value": "NOTEBOOK"
2024-08-27T11:01:57.6437559Z       }
2024-08-27T11:01:57.6437654Z     ]
2024-08-27T11:01:57.6437763Z   },
                                ... REDACTED
2024-08-27T11:01:57.6483974Z ]
2024-08-27T11:01:57.6484127Z 11:01:57  INFO completed execution pid=1874 exit_code=0
andrewnester commented 3 months ago

Ah, I see, CLI auth works because it eventually configures to use azure-cli auth type and not OIDC one so it might be not what you expect anyway.

So to summarise:

  1. to be able to use OIDC in Azure Pipelines change in Go SDK needs to be made hence this ticket
  2. The issue that CLI authenticates with azure-cli type but bundles failed to do so is separate one and might be related to some miss on bundles side where we don't pass all necessary env variables. If this is an issue for you, please feel free to open a separate ticket for this in Databricks CLI repo.

Thank you!

Pim-Mostert commented 3 months ago

It's not an issue for me right now, but I expect it will be in the near future (when my company disables the old service connection). I've opened a new issue: https://github.com/databricks/cli/issues/1722

Please let me know if you need more information.

Thanks!