databricks / terraform-provider-databricks

Databricks Terraform Provider
https://registry.terraform.io/providers/databricks/databricks/latest
Other
449 stars 386 forks source link

[ISSUE] Issue with `databricks_permissions` resource since `1.10.0` #2052

Closed thaiphv closed 1 year ago

thaiphv commented 1 year ago

Configuration

We have many databricks_permissions resources in our templates. They all look similar to the sample code below

resource "databricks_notebook" "userlogs" {
  content_base64 = base64encode(<<-EOT
    logs = spark.read.option("header", True).csv("/mnt/UserLogs/userlogs_notebook.csv")
    display(logs)
    EOT
  )
  path     = "/Workspace/Audit_Logs/UserLogs_Notebook"
  language = "PYTHON"
}

resource "databricks_permissions" "userlogs" {
  directory_path = "/Workspace/Audit_Logs"

  access_control {
    group_name       = "workspace_admins"
    permission_level = "CAN_EDIT"
  }

  depends_on = [
    databricks_notebook.userlogs,
  ]
}

Expected Behavior

Before 1.10.0, the templates can be applied without any errors. All the resources were created successfully.

Actual Behavior

Since 1.10.0, Terraform failed to apply and showed the following errors (please ignore the prefix from the logs as they were obtained from my CI/CD system)

14:11:48   Info     |       module.workspace_team_data_science.databricks_permissions.cluster_usage[0]: Refreshing state... [id=/clusters/0708-013148-c1p0ylun]
14:11:48   Info     |       module.workspace_project_breach_reporting_data_publication.databricks_permissions.cluster_usage[0]: Refreshing state... [id=/clusters/0706-073051-s5qacrut]
14:11:48   Info     |       module.workspace_project_breach_reporting_data_publication.aws_ssm_parameter.cluster_id: Refreshing state... [id=/preprod/datalake/databricks/cluster-id/project-stratpolbrreportingdatapubl]
14:11:48   Info     |       module.workspace_project_audit_logs.aws_ssm_parameter.cluster_id: Refreshing state... [id=/preprod/datalake/databricks/cluster-id/project-auditlogs]
14:11:48   Info     |       module.s3_mount.databricks_job.s3_mount: Refreshing state... [id=270265880191491]
14:12:03   Error    |       Error: Get "https://asic-preprod.cloud.databricks.com/api/2.0/preview/scim/v2/Me": EOF
14:12:03   Error    |       with module.workspace_team_strategic_intelligence.databricks_permissions.cluster_usage[0],
14:12:03   Error    |       on modules/workspace/databricks_clusters.tf line 61, in resource "databricks_permissions" "cluster_usage":
14:12:03   Error    |       61: resource "databricks_permissions" "cluster_usage" {
14:12:03   Error    |       Error: Get "https://asic-preprod.cloud.databricks.com/api/2.0/preview/scim/v2/Me": EOF
14:12:03   Error    |       with module.workspace_team_wealth_management.databricks_permissions.cluster_usage[0],
14:12:03   Error    |       on modules/workspace/databricks_clusters.tf line 61, in resource "databricks_permissions" "cluster_usage":
14:12:03   Error    |       61: resource "databricks_permissions" "cluster_usage" {
14:12:03   Error    |       Error: Get "https://asic-preprod.cloud.databricks.com/api/2.0/preview/scim/v2/Me": EOF
14:12:03   Error    |       with module.workspace_team_nlp_registered_liquidators.databricks_permissions.cluster_usage[0],
14:12:03   Error    |       on modules/workspace/databricks_clusters.tf line 61, in resource "databricks_permissions" "cluster_usage":
14:12:03   Error    |       61: resource "databricks_permissions" "cluster_usage" {
14:12:03   Error    |       Error: Get "https://asic-preprod.cloud.databricks.com/api/2.0/preview/scim/v2/Me": EOF
14:12:03   Error    |       with module.workspace_team_data_analytics_platforms.databricks_permissions.workspace[0],
14:12:03   Error    |       on modules/workspace/databricks_notebooks.tf line 13, in resource "databricks_permissions" "workspace":
14:12:03   Error    |       13: resource "databricks_permissions" "workspace" {
14:12:03   Error    |       Error: cannot read permissions: Get "https://asic-preprod.cloud.databricks.com/api/2.0/preview/scim/v2/Me": EOF
14:12:03   Error    |       with module.workspace_ml_reportable_situations.module.workspace_publish.databricks_permissions.workspace[0],
14:12:03   Error    |       on modules/workspace/databricks_notebooks.tf line 13, in resource "databricks_permissions" "workspace":
14:12:03   Error    |       13: resource "databricks_permissions" "workspace" {
14:12:03   Error    |       Error: cannot read permissions: Get "https://asic-preprod.cloud.databricks.com/api/2.0/preview/scim/v2/Me": EOF
14:12:03   Error    |       with module.workspace_project_information_resource_centre.databricks_permissions.workspace[0],
14:12:03   Error    |       on modules/workspace/databricks_notebooks.tf line 13, in resource "databricks_permissions" "workspace":
14:12:03   Error    |       13: resource "databricks_permissions" "workspace" {
14:12:03   Error    |       Error: Get "https://asic-preprod.cloud.databricks.com/api/2.0/preview/scim/v2/Me": EOF
14:12:03   Error    |       with module.workspace_project_asic_property.databricks_permissions.userlogs[0],
14:12:03   Error    |       on modules/workspace/databricks_notebooks.tf line 38, in resource "databricks_permissions" "userlogs":
14:12:03   Error    |       38: resource "databricks_permissions" "userlogs" {

Steps to Reproduce

  1. terraform init
  2. terraform plan or terraform apply.

Terraform and provider versions

Debug Output

Nothing standout from the debug logs. All the error messages were similar to the output above.

Important Factoids

Looks like the migration to the Databricks SDK has changed the way the SCIM API was used.

aihw-jimsolomos commented 1 year ago

I am facing a similar issue with referencing a when provisioning access to cluster polices.

Issue occurs only on 1.10.0 and 1.10.1, I have upgraded as I am seeing other issues that I was trying to reproduce on the latest version before raising an issue.

I am available 10am-5pm (UTC+10) if you would like chat in real time. example code

resource "databricks_cluster_policy" "cluster_policy" {
  name       = var.CLUSTER_POLICY_NAME
  definition = jsonencode(var.CLUSTER_POLICY_DEF)
}

resource "databricks_permissions" "can_use_cluster_policyinstance_profile" {
  cluster_policy_id = databricks_cluster_policy.cluster_policy.id
  access_control {
    group_name       = var.POLICY_USER_GROUP
    permission_level = "CAN_USE"
  }
  depends_on = [
    databricks_cluster_policy.cluster_policy
  ]
}

I am using the following providers

 azurerm = {
      source  = "hashicorp/azurerm"
      version = "3.45.0"
    }
    databricks = {
      source  = "databricks/databricks"
      version = "1.10.1"

    }
2023-03-01T04:53:20.533Z [ERROR] provider.terraform-provider-databricks_v1.10.1: Response contains error diagnostic: diagnostic_detail= diagnostic_summary="inner token: token error: {"error":"invalid_request","error_description":"Identity not found"}" tf_provider_addr=registry.terraform.io/databricks/databricks tf_rpc=PlanResourceChange @module=sdk.proto diagnostic_severity=ERROR tf_proto_version=5.3 tf_req_id=fd335b0e-9fe5-ab1d-d4b5-9e84314e3b72 tf_resource_type=databricks_permissions @caller=/home/runner/work/terraform-provider-databricks/terraform-provider-databricks/vendor/github.com/hashicorp/terraform-plugin-go/tfprotov5/internal/diag/diagnostics.go:55 timestamp=2023-03-01T04:53:20.532Z
2023-03-01T04:53:20.533Z [ERROR] vertex "module.cluster_compute_policy[\"high_memory_cluster\"].databricks_permissions.can_use_cluster_policyinstance_profile" error: inner token: token error: {"error":"invalid_request","error_description":"Identity not found"}
2023-03-01T04:53:20.534Z [ERROR] provider.terraform-provider-databricks_v1.10.1: Response contains error diagnostic: tf_provider_addr=registry.terraform.io/databricks/databricks @caller=/home/runner/work/terraform-provider-databricks/terraform-provider-databricks/vendor/github.com/hashicorp/terraform-plugin-go/tfprotov5/internal/diag/diagnostics.go:55 @module=sdk.proto tf_proto_version=5.3 tf_resource_type=databricks_permissions tf_rpc=PlanResourceChange diagnostic_detail= diagnostic_severity=ERROR diagnostic_summary="inner token: token error: {"error":"invalid_request","error_description":"Identity not found"}" tf_req_id=daf7aaff-9742-782e-30c4-8321d04ecb08 timestamp=2023-03-01T04:53:20.534Z
2023-03-01T04:53:20.534Z [ERROR] vertex "module.cluster_compute_policy[\"general_purpose_cluster\"].databricks_permissions.can_use_cluster_policyinstance_profile" error: inner token: token error: {"error":"invalid_request","error_description":"Identity not found"}
2023-03-01T04:53:20.536Z [ERROR] provider.terraform-provider-databricks_v1.10.1: Response contains error diagnostic: diagnostic_severity=ERROR tf_proto_version=5.3 tf_provider_addr=registry.terraform.io/databricks/databricks tf_resource_type=databricks_permissions @caller=/home/runner/work/terraform-provider-databricks/terraform-provider-databricks/vendor/github.com/hashicorp/terraform-plugin-go/tfprotov5/internal/diag/diagnostics.go:55 @module=sdk.proto diagnostic_detail= diagnostic_summary="inner token: token error: {"error":"invalid_request","error_description":"Identity not found"}" tf_req_id=ba1bdcf6-4ca3-4db9-28e7-91038dd60c37 tf_rpc=PlanResourceChange timestamp=2023-03-01T04:53:20.536Z
2023-03-01T04:53:20.536Z [ERROR] vertex "module.cluster_compute_policy[\"high_compute_cluster\"].databricks_permissions.can_use_cluster_policyinstance_profile" error: inner token: token error: {"error":"invalid_request","error_description":"Identity not found"}
2023-03-01T04:53:20.536Z [ERROR] vertex "module.cluster_compute_policy.databricks_permissions.can_use_cluster_policyinstance_profile (expand)" error: inner token: token error: {"error":"invalid_request","error_description":"Identity not found"}
2023-03-01T04:53:20.537Z [ERROR] vertex "module.cluster_compute_policy.databricks_permissions.can_use_cluster_policyinstance_profile (expand)" error: inner token: token error: {"error":"invalid_request","error_description":"Identity not found"}
2023-03-01T04:53:20.537Z [ERROR] vertex "module.cluster_compute_policy.databricks_permissions.can_use_cluster_policyinstance_profile (expand)" error: inner token: token error: {"error":"invalid_request","error_description":"Identity not found"}
2023-03-01T04:53:20.538Z [INFO]  backend/local: plan operation completed
alexott commented 1 year ago

@nfx - problems with 1.10.x

nfx commented 1 year ago

Thank you for reporting, will investigate today

nfx commented 1 year ago

@thaiphv are you able to apply this patch and re-run the pipeline? i remember you committed here before :) make sure to check out ~/.terraformrc approach.

diff --git a/permissions/resource_permissions.go b/permissions/resource_permissions.go
index f71bceb5..9ae18a8e 100644
--- a/permissions/resource_permissions.go
+++ b/permissions/resource_permissions.go
@@ -144,7 +144,7 @@ func (a PermissionsAPI) ensureCurrentUserCanManageObject(objectID string, object
    }
    me, err := scim.NewUsersAPI(a.context, a.client).Me()
    if err != nil {
-       return objectACL, err
+       return objectACL, fmt.Errorf("ensure current user: %w", err)
    }
    objectACL.AccessControlList = append(objectACL.AccessControlList, AccessControlChange{
        UserName:        me.UserName,
@@ -187,7 +187,7 @@ func (a PermissionsAPI) Update(objectID string, objectACL AccessControlChangeLis
        if owners == 0 {
            me, err := scim.NewUsersAPI(a.context, a.client).Me()
            if err != nil {
-               return err
+               return fmt.Errorf("api wrapper: update: %w", err)
            }
            // add owner if it's missing, otherwise automated planning might be difficult
            objectACL.AccessControlList = append(objectACL.AccessControlList, AccessControlChange{
@@ -393,7 +393,7 @@ func ResourcePermissions() *schema.Resource {
            }
            me, err := w.CurrentUser.Me(ctx)
            if err != nil {
-               return err
+               return fmt.Errorf("customize diff: me: %w", err)
            }
            // Plan time validation for object permission levels
            for _, mapping := range permissionsResourceIDFields() {
@@ -426,7 +426,7 @@ func ResourcePermissions() *schema.Resource {
            }
            me, err := w.CurrentUser.Me(ctx)
            if err != nil {
-               return err
+               return fmt.Errorf("read: me: %w", err)
            }
            entity, err := objectACL.ToPermissionsEntity(d, me.UserName)
            if err != nil {
nfx commented 1 year ago

@aihw-jimsolomos it looks like you're using Azure MSI. Can you open a separate issue and specify what provider configuration attributes are you using? Is it user-assigned or system-assigned MSI? what environment variables are supplied?

thaiphv commented 1 year ago

@nfx thanks for looking into it but I can't really test this on the CI/CD system at work.

nfx commented 1 year ago

@thaiphv oh. okay. i'll release then ~1.10.2~ 1.11.0 with these minor helpers to figure out the source of the problem. which timezone are you in? could you contact me on my databricks email?

nfx commented 1 year ago

@thaiphv please update to v1.11.0, i've added a bit more debugging info to errors and logs.

Are you able to reproduce it outside of your CI/CD system? Can you temporarily add debug logging to your CI/CD server? Could you send debug logs to my databricks email?

TF_LOG=DEBUG terraform apply -no-color 2>&1 | grep databricks -A5 |tee tf-debug.log


I'm particularly interested in lines like

2023-03-01T22:16:45.609+0100 [INFO]   provider.terraform-provider-databricks_v1.10.1: Explicit and implicit attributes: account_id, auth_type, google_service_account, host ...
...
2023-03-01T22:16:45.609+0100 [INFO]  provider.terraform-provider-databricks_v1.10.1: Ignoring pat auth, because google-id is preferred: timestamp=2023-03-01T22:16:45.609+0100
2023-03-01T22:16:45.609+0100 [INFO]  provider.terraform-provider-databricks_v1.10.1: Ignoring basic auth, because google-id is preferred: timestamp=2023-03-01T22:16:45.609+0100
2023-03-01T22:16:45.609+0100 [INFO]  provider.terraform-provider-databricks_v1.10.1: Ignoring oauth-m2m auth, because google-id is preferred: timestamp=2023-03-01T22:16:45.609+0100
2023-03-01T22:16:45.609+0100 [INFO]  provider.terraform-provider-databricks_v1.10.1: Ignoring bricks-cli auth, because google-id is preferred: timestamp=2023-03-01T22:16:45.609+0100
2023-03-01T22:16:45.609+0100 [INFO]  provider.terraform-provider-databricks_v1.10.1: Ignoring azure-msi auth, because google-id is preferred: timestamp=2023-03-01T22:16:45.609+0100
2023-03-01T22:16:45.609+0100 [INFO]  provider.terraform-provider-databricks_v1.10.1: Ignoring azure-client-secret auth, because google-id is preferred: timestamp=2023-03-01T22:16:45.609+0100
2023-03-01T22:16:45.609+0100 [INFO]  provider.terraform-provider-databricks_v1.10.1: Ignoring azure-cli auth, because google-id is preferred: timestamp=2023-03-01T22:16:45.609+0100
2023-03-01T22:16:45.609+0100 [INFO]  provider.terraform-provider-databricks_v1.10.1: Ignoring google-credentials auth, because google-id is preferred: timestamp=2023-03-01T22:16:45.609+0100
2023-03-01T22:16:45.609+0100 [INFO]  provider.terraform-provider-databricks_v1.10.1: Using Google Default Application Credentials for Accounts API: timestamp=2023-03-01T22:16:45.609+0100
nfx commented 1 year ago

stats-wise, I'm seeing hundreds of customers upgraded to v1.10+ and using permissions resources for notebooks, cluster policies, and directories, using all authentication methods and operating systems. Now the challenge is figuring out those "special" setups.

Error: Get "https://XXX.cloud.databricks.com/api/2.0/preview/scim/v2/Me": EOF

got me thinking about some sort of proxy. I'd need to see a full error message to confirm or deny it.

aihw-jimsolomos commented 1 year ago

I have applied the 1.11.0 patch with the same issue I will create a separate issue shortly. For context I am running databricks on Azure in Australia. The configuration pipeline is a vanilla ubuntu MS hosted agent running out of Australia as well

nfx commented 1 year ago

@alexott, do we have any ADO examples with MSI?

thaiphv commented 1 year ago

@nfx, the error messages aren't too much different

Error: customize diff: me: Get "https://asic-preprod.cloud.databricks.com/api/2.0/preview/scim/v2/Me": EOF
with module.workspace_team_assessment_intelligence.databricks_permissions.cluster_usage[0],
on modules/workspace/databricks_clusters.tf line 61, in resource "databricks_permissions" "cluster_usage":
61: resource "databricks_permissions" "cluster_usage" {

I thought that the endpoint might have been rate-limited but when I tried setting DATABRICKS_RATE_LIMIT to 1 the errors still occurred.

nfx commented 1 year ago

So it’s customize diff after all. Thanks for trying the update.

I’ll adjust the logic over there.

Somehow I cannot reproduce the conditions for this error to happen in my environments.

alexott commented 1 year ago

@alexott, do we have any ADO examples with MSI?

we have open issue for it: https://github.com/databricks/terraform-databricks-examples/issues/2

nfx commented 1 year ago

@thaiphv please update the provider to v1.11.1