databricks / terraform-provider-databricks

Databricks Terraform Provider
https://registry.terraform.io/providers/databricks/databricks/latest
Other
445 stars 383 forks source link

[ISSUE] Error reading previously created git credentials for service principal #1502

Closed jschra closed 2 years ago

jschra commented 2 years ago

Hi there,

In my configurations, I am sequentially building a workspace to thereafter enter it and deploy services within it. In doing so, I also create a service principal for which I generate an obo token that I want to store separately for automation services.

In the last step of my configuration, I take this obo token and the host url of my workspace to enter it a second time but now as the service principal in order to store git credentials there. Any subsequent pipelines can then enter my workspace and start pulling Git repositories, without having to worry about this.

Now when I initially do this, it works fine. I can store the git credentials, I can pull a repo and then call it a day. If I run an additional plan or apply right after, it also still works.

When I try to rerun my configurations the next day, however, it no longer seems to pick up on the Terraform provider I configured. I get the following error when try to run my configs locally:

image

and the following error when my DevOps pipeline tries to run it:

image

Apparently, it no longer picks up on the databricks.git provider I pass to the resource and instead starts to try and look-up credentials in other places, where it then fails.

My question is: how? In my configurations I pass on the host url and obo_token to a separately generated databricks provider, which I then explicitly use in the resource block for the git credentials and the git repo. More on that below.

Configuration

# Provider.tf -- I left out my back-end config, but it's set to S3

terraform {
  required_providers {
    databricks = {
      source                = "databricks/databricks"
      configuration_aliases = [databricks, databricks.mws, databricks.git]
      version               = "~> 1.0"
    }
    aws = {
      source  = "hashicorp/aws"
      version = "~> 4.0"
    }
    azuredevops = {
      source  = "microsoft/azuredevops"
      version = ">=0.2.0"
    }
  }
}

# Databricks providers

# -- Provider initialized for workspace creation
#     I use this module to create the workspace
provider "databricks" {
  alias    = "mws"
  host     = "https://accounts.cloud.databricks.com"
  username = var.databricks_account_username
  password = var.databricks_account_password
}

# -- Provider initialized for workspace population
#      I use this module to add services to the workspace as admin
provider "databricks" {
  host     = module.pm_databricks_workspace.databricks_host
  username = var.databricks_account_username
  password = var.databricks_account_password
}

# -- Provider initialized for storing git credentials
#    Note that this uses the PAT of the service principal to basically login to the
#    workspace and store information for that 'user'. The obo token is generated by the databricks_services module I created
provider "databricks" {
  alias = "git"
  host  = module.pm_databricks_workspace.databricks_host
  token = module.pm_databricks_services.service_principal_token
}

# git_integration.tf

# Store DevOps Git credentials on Databricks
resource "databricks_git_credential" "this" {
  depends_on = [module.pm_databricks_services]

  provider              = databricks.git #provider initialized using obo_token
  git_provider          = "azureDevOpsServices"
  git_username          = var.git_username
  personal_access_token = var.devops_token
  force                 = true
}

# Pull repos into Databricks
resource "databricks_repo" "this" {
  depends_on = [databricks_git_credential.this]
  for_each   = toset(var.git_repos)

  provider = databricks.git
  url      = each.key
}

Expected Behavior

I expect that Terraform plan/apply runs without any issues.

Actual Behavior

Terraform plan/apply stop due to an error, stating that it cannot retrieve the databricks_git_credentials. It does work right after running a successfull apply, but it does not if a day or so passes.

Steps to Reproduce

Please list the steps required to reproduce the issue, for example:

  1. Use an obo_token and host url to enter a workspace and set git_credentials
  2. Run apply
  3. Wait a day or so
  4. Try to run plan/apply

Terraform and provider versions

Terraform v1.2.3
on darwin_amd64
+ provider registry.terraform.io/databricks/databricks v1.0.2
+ provider registry.terraform.io/hashicorp/aws v4.22.0
+ provider registry.terraform.io/hashicorp/random v3.3.2
+ provider registry.terraform.io/hashicorp/time v0.7.2
+ provider registry.terraform.io/microsoft/azuredevops v0.2.2

Debug Output

https://gist.github.com/jschra/d9e958460193b1c20ea644bb036f9668

Important Factoids

None

nkvuong commented 2 years ago

@jschra Do you have set any expiry date for the obo token (module.pm_databricks_services.service_principal_token)

jschra commented 2 years ago

@nkvuong Yes, I do. Below you can find the snippet that runs in my services module that creates the obo token:

# Create service principal for API access from DevOps
resource "databricks_service_principal" "this" {
  display_name = "DevOps automation service principal"

  # No access by default, only through groups
  allow_cluster_create       = false
  allow_instance_pool_create = false
  databricks_sql_access      = false
}

# Add to developers group
resource "databricks_group_member" "sp" {
  group_id  = databricks_group.developers.id
  member_id = databricks_service_principal.this.id
}

# Generate PAT
resource "databricks_obo_token" "this" {
  depends_on       = [databricks_group_member.sp]
  application_id   = databricks_service_principal.this.application_id
  comment          = "PAT on behalf of ${databricks_service_principal.this.display_name}"
  lifetime_seconds = 129600
}
alexott commented 2 years ago

it's not related to Git, but more to the authentication...

nkvuong commented 2 years ago

basically what happened is that the obo token expired every 1.5 days, so your databricks.git provider will fail to initialise leading to the error message. cannot read git credential: cannot configure databricks-cli auth: /Users/jschra/.databrickscfg has no DEFAULT profile configured. Attributes used: host. Please check https://registry.terraform.io/providers/databricks/databricks/latest/docs#authentication for details

jschra commented 2 years ago

@nkvuong, ok lol, that was really stupid on my side. Apologies for that.

Question is, however, how to make such a configuration robust for obo tokens that expire. Now it's (accidentally) set to 1,5 days, I'd probably set it to 90. But then again, I would have this same problem after 90 days.

Any ideas on how I can ensure that the token is recreated before it expires? Otherwise I will eventually always end up in this situation (given that I want to keep all these configurations in one Terraform apply run)

nkvuong commented 2 years ago

@jschra there is no easy way to do this in just a single Terraform apply. It is fundamental to how Terraform works.

The key issue here is that providers need to be instantiated for all operations, not just apply. In theory, Terraform apply will succeed (because it will handle the dependency correctly, i.e. generate the obo token, supply that to the provider, then read the git credential), but Terraform plan will fail (because there is no token for the databricks.git provider)

My suggestion would be to split this into 2 separate configurations, using the output of the first one as input for the 2nd one (via secret manager for example). Your apply script needs to then run 2 Terraform apply sequentially, but hopefully it's not too much work

jschra commented 2 years ago

Yeah makes sense. I reckon this would only work if the obo_token would have a parameter that would enforce it to be re-created before it expires by a said amount of time. Say if you run with a token for 90 days, force it to be replaced after 85 days. With a pipeline that runs the TF configurations daily, that would solve the problem.

But that's more of a feature request anyways. Thanks a lot for looking a long and pulling out this very clumsy mistake of mine!

nkvuong commented 2 years ago

@jschra you could try combining time_rotating with replace_triggered_by, so that the token is replaced after 85 days

jschra commented 2 years ago

That’s a great idea @nkvuong ! Will give it a try tomorrow, will keep you posted. Cheers!

jschra commented 2 years ago

@nkvuong tried adding logic using the resources you mentioned and it does allow me to replace the obo_token before it expires. If I use said token in a provider to subsequently enter the workspace, however, it still bugs out. Apparently plan detects the lifecycle rule that replaces the token, resulting in an empty token at the time of plan, due to which the plan fails as it is unable to login to the workspace using the token.

This is what my config looks like right now:

# Create service principal for API access from DevOps
resource "databricks_service_principal" "this" {
  provider = databricks.test
  display_name = "test PAT"

  # No access by default, only through groups
  allow_cluster_create       = false
  allow_instance_pool_create = false
  databricks_sql_access      = false
}

# Create rotation time object of a minute
resource "time_rotating" "example" {
    rotation_minutes = 1
}

resource "random_id" "test" {
  keepers = {
    time_rotating = time_rotating.example.id
  }
  byte_length = 8
}

# Generate PAT
resource "databricks_obo_token" "this" {
  provider = databricks.test
  application_id   = databricks_service_principal.this.application_id
  comment          = "PAT on behalf of ${databricks_service_principal.this.display_name}"
  lifetime_seconds = 60000

  lifecycle {
    replace_triggered_by = [
        random_id.test.hex
    ]
  }
}

Works perfectly fine if I do not have a provider based on the token, but if I do then I again get the following error: image

I guess I'll run with your advice to split the configs in two, as that makes all of this significantly easier. Thanks again for taking the time to take a look and think along, even though the initial issue was invalid!

nkvuong commented 2 years ago

@jschra this would work if you split the configuration up, and run terraform apply sequentially, as the new token will be available for the provider in the second configuration to pick up