databricks / terraform-provider-databricks

Databricks Terraform Provider
https://registry.terraform.io/providers/databricks/databricks/latest
Other
445 stars 384 forks source link

[ISSUE] v1.10+ Azure MSI auth through Go SDK is broken with user-assigned identity with permissions on subscription and Account Administrator in Databricks Account #2057

Closed aihw-jimsolomos closed 1 year ago

aihw-jimsolomos commented 1 year ago

Configuration

provider "azurerm" {
  features {}
}

data "azurerm_databricks_workspace" "default" {
  name                = var.WORKSPACENAME
  resource_group_name = var.WORKSPACERESOURCEGROUPNAME
}

provider "databricks" {
  host                        = data.azurerm_databricks_workspace.default.workspace_url
  azure_workspace_resource_id = data.azurerm_databricks_workspace.default.id
  # ARM_USE_MSI environment variable is recommended
  azure_use_msi = true

}
provider "databricks" {
  alias      = "mws"
  host       = var.DATABRICKS_AZ_ACCOUNT_URL
  account_id = var.DATABRICKS_AZ_ACCOUNT_ID
  # ARM_USE_MSI environment variable is recommended
  azure_use_msi = true
}

module "assignmetastore" {
  source                      = "./modules/metastores"
  workspace_id                = data.azurerm_databricks_workspace.default.workspace_id
  METASTORE_NAME              = var.METASTORE_NAME
  METASTORE_STORAGE_ROOT      = var.METASTORE_STORAGE_ROOT
  METASTORE_OWNER             = var.METASTORE_OWNER
  STORAGE_ACCESS_CONNECTOR_ID = var.STORAGE_ACCESS_CONNECTOR_ID

}

# Content of modules/metastores/main.tf
terraform {
  required_providers {
    azurerm = {
      source  = "hashicorp/azurerm"
      version = "3.45.0"
    }
    databricks = {
      source  = "databricks/databricks"
      version = "1.11.0"

    }
  }
}

resource "databricks_metastore" "this" {
  name          = var.METASTORE_NAME
  storage_root  = var.METASTORE_STORAGE_ROOT
  owner         = var.METASTORE_OWNER
  force_destroy = true

}

resource "databricks_metastore_data_access" "first" {
  metastore_id = databricks_metastore.this.id
  name         = regex("[^\\/]+$", var.STORAGE_ACCESS_CONNECTOR_ID)
  azure_managed_identity {
    access_connector_id = var.STORAGE_ACCESS_CONNECTOR_ID
  }

depends_on = [
  databricks_metastore.this
]

  is_default = true
}

resource "databricks_metastore_assignment" "this" {
  metastore_id = databricks_metastore.this.id
  workspace_id = var.workspace_id

  depends_on = [
    databricks_metastore_data_access.first
  ]

}

Expected Behavior

prior to 1.10.0 (1.9.2 for example) it appears that you are able to reference groups at the account level. In my current example I have a group metastore admins created that I want to be metastore owner. This var is passed to the resource but the plan fails with the below error.

Actual Behavior

Since 1.10.0 this has failed, note that this example was basically a slight modification of the examples from the documentation.

Steps to Reproduce

  1. create an account level group within databricks
  2. try to create a metastore with the above code referencing the existing group

Terraform and provider versions

Azure and Terraform are the latest version

terraform 1.3.9 databricks 1.11.0 azureRM 3.45.0

Debug Output

-Ms-Routing-Request-Id: AUSTRALIAEAST:20230301T213219Z:c8fe5feb-d0d4-4d54-ab3e-c5ed02691976

{"properties":{"privateEndpointConnections":[{"id":"/subscriptions/deleted/resourceGroups/dev-disability-databricks-rg/providers/Microsoft.Databricks/workspaces/dev-disability-databricksResearchEnv/privateEndpointConnections/dev-disability-databricksResearchEnv-private-endpoint","name":"dev-disability-databricksResearchEnv-private-endpoint","type":"Microsoft.Databricks/workspaces/privateEndpointConnections","properties":{"privateEndpoint":{"id":"/subscriptions/deleted/resourceGroups/dev-disability-databricks-rg/providers/Microsoft.Network/privateEndpoints/dev-disability-databricksResearchEnv-private-endpoint"},"groupIds":["databricks_ui_api"],"privateLinkServiceConnectionState":{"status":"Approved","description":"Auto-approved","actionsRequired":"None"},"provisioningState":"Succeeded"}}],"publicNetworkAccess":"Enabled","requiredNsgRules":"NoAzureDatabricksRules","managedResourceGroupId":"/subscriptions/deleted/resourceGroups/dev-disability-databricksResearchEnv-rg","parameters":{"customPrivateSubnetName":{"type":"String","value":"DatabricksProductSubnetPrivate"},"customPublicSubnetName":{"type":"String","value":"DatabricksProductSubnetPublic"},"customVirtualNetworkId":{"type":"String","value":"/subscriptions/deleted/resourceGroups/dev-network-rg/providers/Microsoft.Network/virtualNetworks/dev-vnet"},"enableFedRampCertification":{"type":"Bool","value":false},"enableNoPublicIp":{"type":"Bool","value":true},"natGatewayName":{"type":"String","value":"nat-gateway"},"prepareEncryption":{"type":"Bool","value":false},"publicIpName":{"type":"String","value":"nat-gw-public-ip"},"requireInfrastructureEncryption":{"type":"Bool","value":false},"resourceTags":{"type":"Object","value":{"application":"databricks","databricks-environment":"true","Owner":"Data Management and Analytics ","Project":"Data Management and Analytics","Environment":"dev","Name":"dev"}},"storageAccountName":{"type":"String","value":"dbstoragef5n4m3fvilzcu"},"storageAccountSkuName":{"type":"String","value":"Standard_GRS"},"vnetAddressPrefix":{"type":"String","value":"10.139"}},"provisioningState":"Succeeded","authorizations":[{"principalId":"9a74af6f-d153-4348-988a-e2672920bee9","roleDefinitionId":"8e3af657-a8ff-443c-a75c-2fe8c4bcb635"}],"createdBy":{"oid":"5ac85ca7-2fde-4827-a661-f9a93ae6b516","applicationId":"a5e17c8e-f882-4e04-bd42-64c16af26df8"},"updatedBy":{"oid":"5ac85ca7-2fde-4827-a661-f9a93ae6b516","applicationId":"a5e17c8e-f882-4e04-bd42-64c16af26df8"},"workspaceId":"1353120338516096","workspaceUrl":"adb-1353120338516096.16.azuredatabricks.net","createdDateTime":"2023-01-23T07:43:34.4083755Z"},"id":"/subscriptions/deleted/resourceGroups/dev-disability-databricks-rg/providers/Microsoft.Databricks/workspaces/dev-disability-databricksResearchEnv","name":"dev-disability-databricksResearchEnv","type":"Microsoft.Databricks/workspaces","sku":{"name":"premium"},"location":"australiaeast","tags":{"Owner":"Data Management and Analytics ","Project":"Data Management and Analytics","Environment":"dev","Name":"dev"}}: timestamp=2023-03-01T21:32:19.690Z data.azurerm_databricks_workspace.default: Read complete after 1s [id=/subscriptions/deleted/resourceGroups/dev-disability-databricks-rg/providers/Microsoft.Databricks/workspaces/dev-disability-databricksResearchEnv] 2023-03-01T21:32:19.693Z [DEBUG] created provider logger: level=debug 2023-03-01T21:32:19.693Z [INFO] provider: configuring client automatic mTLS 2023-03-01T21:32:19.705Z [DEBUG] provider: starting plugin: path=.terraform/providers/registry.terraform.io/databricks/databricks/1.11.0/linux_amd64/terraform-provider-databricks_v1.11.0 args=[.terraform/providers/registry.terraform.io/databricks/databricks/1.11.0/linux_amd64/terraform-provider-databricks_v1.11.0] 2023-03-01T21:32:19.705Z [DEBUG] provider: plugin started: path=.terraform/providers/registry.terraform.io/databricks/databricks/1.11.0/linux_amd64/terraform-provider-databricks_v1.11.0 pid=1784 2023-03-01T21:32:19.705Z [DEBUG] provider: waiting for RPC address: path=.terraform/providers/registry.terraform.io/databricks/databricks/1.11.0/linux_amd64/terraform-provider-databricks_v1.11.0 2023-03-01T21:32:19.717Z [DEBUG] provider.terraform-provider-databricks_v1.11.0: Databricks Terraform Provider 2023-03-01T21:32:19.717Z [DEBUG] provider.terraform-provider-databricks_v1.11.0: 2023-03-01T21:32:19.717Z [DEBUG] provider.terraform-provider-databricks_v1.11.0: Version 1.11.0 2023-03-01T21:32:19.717Z [DEBUG] provider.terraform-provider-databricks_v1.11.0: 2023-03-01T21:32:19.717Z [DEBUG] provider.terraform-provider-databricks_v1.11.0: https://registry.terraform.io/providers/databricks/databricks/latest/docs 2023-03-01T21:32:19.717Z [DEBUG] provider.terraform-provider-databricks_v1.11.0: 2023-03-01T21:32:19.720Z [INFO] provider.terraform-provider-databricks_v1.11.0: configuring server automatic mTLS: timestamp=2023-03-01T21:32:19.718Z 2023-03-01T21:32:19.757Z [DEBUG] provider.terraform-provider-databricks_v1.11.0: plugin address: address=/tmp/plugin2621197164 network=unix timestamp=2023-03-01T21:32:19.757Z 2023-03-01T21:32:19.758Z [DEBUG] provider: using plugin: version=5 2023-03-01T21:32:19.795Z [WARN] ValidateProviderConfig from "provider[\"registry.terraform.io/databricks/databricks\"]" changed the config value, but that value is unused 2023-03-01T21:32:19.803Z [INFO] provider.terraform-provider-databricks_v1.11.0: Explicit and implicit attributes: azure_client_id, azure_client_secret, azure_tenant_id, azure_workspace_resource_id, host: timestamp=2023-03-01T21:32:19.802Z 2023-03-01T21:32:19.811Z [INFO] ReferenceTransformer: reference not found: "var.METASTORE_OWNER" 2023-03-01T21:32:19.811Z [INFO] ReferenceTransformer: reference not found: "var.METASTORE_NAME" 2023-03-01T21:32:19.811Z [INFO] ReferenceTransformer: reference not found: "var.METASTORE_STORAGE_ROOT" 2023-03-01T21:32:19.811Z [DEBUG] ReferenceTransformer: "module.assignmetastore.databricks_metastore.this" references: [] module.assignmetastore.databricks_metastore.this: Refreshing state... [id=473daebd-abc8-4989-9840-b959cd17a4d4] 2023-03-01T21:32:19.828Z [DEBUG] provider.terraform-provider-databricks_v1.11.0: Generating AAD token via Azure MSI: timestamp=2023-03-01T21:32:19.828Z 2023-03-01T21:32:19.845Z [ERROR] provider.terraform-provider-databricks_v1.11.0: Response contains error diagnostic: tf_rpc=ReadResource @caller=/home/runner/work/terraform-provider-databricks/terraform-provider-databricks/vendor/github.com/hashicorp/terraform-plugin-go/tfprotov5/internal/diag/diagnostics.go:55 @module=sdk.proto diagnostic_severity=ERROR diagnostic_summary="cannot read metastore: inner token: token error: {"error":"invalid_request","error_description":"Identity not found"}" tf_resource_type=databricks_metastore diagnostic_detail= tf_proto_version=5.3 tf_provider_addr=registry.terraform.io/databricks/databricks tf_req_id=f96ff394-f9f8-4869-dff4-8e76998ea7aa timestamp=2023-03-01T21:32:19.845Z 2023-03-01T21:32:19.846Z [ERROR] vertex "module.assignmetastore.databricks_metastore.this" error: cannot read metastore: inner token: token error: {"error":"invalid_request","error_description":"Identity not found"} 2023-03-01T21:32:19.846Z [ERROR] vertex "module.assignmetastore.databricks_metastore.this (expand)" error: cannot read metastore: inner token: token error: {"error":"invalid_request","error_description":"Identity not found"} 2023-03-01T21:32:19.848Z [INFO] backend/local: plan operation completed β•· β”‚ Error: cannot read metastore: inner token: token error: {"error":"invalid_request","error_description":"Identity not found"} β”‚  β”‚  with module.assignmetastore.databricks_metastore.this, β”‚  on modules/metastores/main.tf line 17, in resource "databricks_metastore" "this": β”‚  17: resource "databricks_metastore" "this" { β”‚  β•΅

Important Factoids

running in AustraliaEast

aihw-jimsolomos commented 1 year ago

2052 -- similar issue.

nfx commented 1 year ago

Please also state:

aihw-jimsolomos commented 1 year ago

MSI type: user assigned - the devops agent service connection user that has the permissions within the subscription. The SP is also assigned "Account administrator" within Databricks account

Environment variables

DATABRICKS_AZ_ACCOUNT_URL = "https://accounts.azuredatabricks.net"
DATABRICKS_AZ_ACCOUNT_ID  = "I can email this if required"
METASTORE_NAME         = "primary"
METASTORE_STORAGE_ROOT = "abfss://unitycatalog@<can email if required>.dfs.core.windows.net/"
METASTORE_OWNER        = "metastoreadmins"
WORKSPACENAME = "dev-disability-databricksResearchEnv"
WORKSPACERESOURCEGROUPNAME = "dev-disability-databricks-rg"
STORAGE_ACCESS_CONNECTOR_ID = "/subscriptions/<email me if required>/resourceGroups/dev-storage-rg/providers/Microsoft.Databricks/accessConnectors/dev-unitycatalogAccessConnector"

Type of compute is ubuntu 20.04 -- current version here https://github.com/actions/runner-images/blob/ubuntu22/20230219.1/images/linux/Ubuntu2204-Readme.md

nfx commented 1 year ago

@aihw-jimsolomos , good. Thanks for the detail! I'll take a look

btw, you can hard-code "https://accounts.azuredatabricks.net" as the host for account-level provider and use DATABRICKS_ACCOUNT_ID environment variable for it to be picked up automatically.

aihw-jimsolomos commented 1 year ago

I was able to locally reproduce this issue on Windows (as my employer doesn't provide Linux machines)

What you can do is when specifying the databricks provider you can pass in the user provisioned client details like so, this far as I can tell this will run the provider using the managed system identity (user assigned.)

provider "databricks" {
  host                        = data.azurerm_databricks_workspace.default.workspace_url
  azure_workspace_resource_id = data.azurerm_databricks_workspace.default.id
  # ARM_USE_MSI environment variable is recommended
  azure_use_msi       = true
  azure_client_id     = "<SUPER_SECRET_SECRET>"
  azure_client_secret = "<SUPER_SECRET_SECRET>"
  azure_tenant_id     = "<SUPER_SECRET_SECRET>"

}
provider "databricks" {
  alias      = "mws"
  host       = "https://accounts.azuredatabricks.net"
  account_id = var.DATABRICKS_AZ_ACCOUNT_ID
  # ARM_USE_MSI environment variable is recommended
  azure_use_msi       = true
  azure_client_id     = "<SUPER_SECRET_SECRET>"
  azure_client_secret = "<SUPER_SECRET_SECRET>"
  azure_tenant_id     = "<SUPER_SECRET_SECRET>"

}

Should be able to debug without deploying a whole pipeline system. My skills in troubleshooting are fairly weak as I am very new to terraform, if I get some time next week I will try to teach myself more.

aihw-jimsolomos commented 1 year ago

Hi @nfx I think I have found the issue with the move to Go SDK

I used Fiddler examine the different between 1.9.2 and 1.10 (1.15.1 in this case)

The error of "identity not found is coming" from the "/metadata/identity/oauth2/token" service that is hosted on the virtual machine.

What appears to have happened is that for 1.9.2 the authentication provider was the old ADAL endpoint of https://login.microsoftonline.com/<tenant>/oauth2/token HTTP/1.1

That call was sending something like

POST https://login.microsoftonline.com/c2d40835-0130-4bcf-8be3-7ba19466d3b3/oauth2/token HTTP/1.1
Host: login.microsoftonline.com
User-Agent: Go/go1.18.10 (amd64-windows) go-autorest/adal/v1.0.0
Content-Length: 177
Content-Type: application/x-www-form-urlencoded
Cookie: fpc=<deleted>; x-ms-gateway-slice=estsfd; stsservicecookie=estsfd
Accept-Encoding: gzip

client_id=<SUPERSECRET>&client_secret=<SUPERSECRET>&grant_type=client_credentials&resource=2ff814a6-3304-4ab8-85cb-cd0e6f879c1d <-------- Databricks resource ID

image

From this we can see the process has been passed the client secret

However

For the upgraded GO SDK we can see that the process is using a different set of APIs that query the local "/metadata/identity/oauth2/token" service API rather than login.microsoft. image

This API uses the local metadata service that is related to the VM directly rather than extracting something that is called in the command line/from the process.

Now this isn't actually a problem IF in my case I was using a VM that had the system managed identity was actually assigned to the machine but what I am doing is running my pipelines with Microsoft managed Azure DevOps agents. These agents may have a service connection, but they don't get the Managed service identity.

It actually turns out that you are required to use a self-hosted agent on an Azure VM in order to use managed service identity. https://learn.microsoft.com/en-us/azure/devops/pipelines/library/connect-to-azure?view=azure-devops#create-an-azure-resource-manager-service-connection-to-a-vm-with-a-managed-service-identity

So long story short it was just luck that it worked previously because of how the old-style login was being used.

To test this theory, I will setup a managed VM. Will keep you posted.

@camilo-s you might be in the same boat?

hungnguyen10897 commented 1 year ago

Hi,

We're also facing the same issue with Databricks provider version >1.9.2

We run Terraform pipelines also from ADO Agents hosted on our AKS cluster (self-hosted agents). The cluster is assigned a User Assigned Identity with Subscription contributor and Databricks Account Admin role (through aad-pod-identity).

terraform {
  required_providers {
    databricks = {
      source  = "databricks/databricks"
      version = "1.14.3"
    }
  }
  required_version = ">= 1.3.0"
}

provider "databricks" {
  host          = "https://accounts.azuredatabricks.net"
  account_id    = "<DATABRICKS_ACCOUNT_ID>"
  azure_use_msi = true
  # auth_type = "azure-msi"
}

data "databricks_user" "example" {
  user_name = "example_user"
}

output "test" {
 value = data.databricks_user.example.id
}

Error:

β”‚ Error: default auth: azure-cli: cannot get access token: ERROR: Please run 'az login' to setup account.
β”‚ . Config: host=https://accounts.azuredatabricks.net, account_id=<DATABRICKS_ACCOUNT_ID>, azure_use_msi=true
β”‚ 
β”‚   with data.databricks_user.example,
β”‚   on user.tf line 1, in data "databricks_user" "example":
β”‚    1: data "databricks_user" "example" {

and if I uncomment auth_type = "azure-msi", error:

β”‚ Error: default auth: cannot configure default credentials. Config: host=https://accounts.azuredatabricks.net, account_id=<DATABRICKS_ACCOUNT_ID>, azure_use_msi=true
β”‚ 
β”‚   with data.databricks_user.hung,
β”‚   on user.tf line 1, in data "databricks_user" "example":
β”‚    1: data "databricks_user" "example" {
β”‚ 
aihw-jimsolomos commented 1 year ago

@hungnguyen10897 These looks to be two fairly different problems

I would recommend opening a separate bug report In my example I am confident that I MSI shouldn't of ever worked vs you where it should work and our error messages are fairly different.

aihw-jimsolomos commented 1 year ago

@nfx I am going to close this, as I have realised that my Microsoft hosted agent was always using client secret rather than MSI.

auth_type = "azure-client-secret"

But it was good to discover why the SDK "broke" the upgrade.