databricks / terraform-provider-databricks

Databricks Terraform Provider
https://registry.terraform.io/providers/databricks/databricks/latest
Other
449 stars 386 forks source link

[ISSUE] Terraform plan command actually creates the Databricks workspace when an error is encountered #54

Closed chriscrcodes closed 4 years ago

chriscrcodes commented 4 years ago

Hi there,

Terraform Version

Affected Resource(s)

Terraform Configuration Files

variable "client_id" {}
variable "client_secret" {}
variable "tenant_id" {}
variable "subscription_id" {}
variable "resource_group" {}
variable "managed_resource_group_name" {}

provider "azurerm" {
  version         = ">= 2.3.0"
  client_id       = var.client_id
  client_secret   = var.client_secret
  tenant_id       = var.tenant_id
  subscription_id = var.subscription_id
  features {}
}

resource "azurerm_databricks_workspace" "demo" {
  location                    = "westeurope"
  name                        = "databricks-demo-workspace"
  resource_group_name         = var.resource_group
  managed_resource_group_name = var.managed_resource_group_name
  sku                         = "standard"
}

resource "databricks_cluster" "demo" {
  autoscale {
    min_workers = 2
    max_workers = 8
  }
  cluster_name            = "databricks-demo-cluster"
  spark_version           = "6.4.x-scala2.11"
  node_type_id            = "Standard_DS3_v2"
  autotermination_minutes = 30
}

provider "databricks" {
  version = ">= 0.1"
  azure_auth = {
    managed_resource_group = azurerm_databricks_workspace.demo.managed_resource_group_name
    azure_region           = azurerm_databricks_workspace.demo.location
    workspace_name         = azurerm_databricks_workspace.demo.name
    resource_group         = azurerm_databricks_workspace.demo.resource_group_name
    client_id              = var.client_id
    client_secret          = var.client_secret
    tenant_id              = var.tenant_id
    subscription_id        = var.subscription_id
  }
}

Debug Output

https://gist.github.com/christophecremon/f839ac7b0f342d277f0ceaba2fbab165

Expected Behavior

Terraform generates an execution plan. Terraform will perform the following actions, upon apply command has been executed:

Actual Behavior

An error is generated: Error 404 The workspace with resource ID /subscriptions//resourceGroups/databricks-rg/providers/Microsoft.Databricks/workspaces/databricks-demo-workspace could not be found.

Terraform actually creates the Databricks workspace, with a Premium Pricing Tier, even if terraform apply command has not been executed.

Steps to Reproduce

Please list the steps required to reproduce the issue, for example:

  1. terraform plan
sdebruyn commented 4 years ago

Hi @christophecremon, the resource azurerm_databricks_workspace is part of the AzureRM provider and not this one. I don't think there's anything we can do in this provider to fix this issue. I suggest opening an issue on the terraform-provider-azurerm repo. However, I noticed similar behaviour.

chriscrcodes commented 4 years ago

I think that the issue comes from the Databricks provider, not able to find the Databricks workspace, as a prerequisite to create the cluster, and then creates it. A Terraform Plan command should not create any resource, so I think the problem comes from the Databricks provider. Can you please check the resource dependencies? Thanks!

linkinchow commented 4 years ago

@christophecremon Agree. I faced the same issue. It works only if Azure Databricks workspace is provisioned in advance.

sdebruyn commented 4 years ago

This provider doesn't contain any code that can create workspaces. You can see in your logs that terraform-provider-databricks immediately fails because the expected workspace has not been created. It can't create a cluster because it can't create the token that it needs to authorise that request.

chriscrcodes commented 4 years ago

From my understanding, to create a cluster, we must first create a Databricks workspace, using azurerm provider, and then invoke the Databricks provider to create the cluster, referencing the workspace that has been created with the azurerm provider? What I don't understand is why the Databricks workspace (pricing tier premium) is actually created, when using the terraform plan command, after the 404 error?

linkinchow commented 4 years ago

If we remove the databricks provider and databricks_cluster in the code, terraform plan will not create the workspace. So I'm guessing the databricks provider cause the issue?

stikkireddy commented 4 years ago

Hey this is an issue caused by this provider, thank you for pointing it out. Identified the root cause being this https://management.azure.com/subscriptions/%s/resourceGroups/%s/providers/Microsoft.Databricks/workspaces/%s?api-version=2018-04-01 api call. It is part of the azure_auth: getWorkspaceID. So converting this to generate the id will resolve the creation of the workspace issue.

@christophecremon

However it will still make your plans fail because the workspace does not exist.

So proposed solution is:

Previous:

func (a *AzureAuth) getWorkspaceID(config *service.DBApiClientConfig) error {
    log.Println("[DEBUG] Getting Workspace ID via management token.")
    // Escape all the ids
    url := fmt.Sprintf("https://management.azure.com/subscriptions/%s/resourceGroups/%s"+
        "/providers/Microsoft.Databricks/workspaces/%s?api-version=2018-04-01",
        urlParse.PathEscape(a.TokenPayload.SubscriptionID),
        urlParse.PathEscape(a.TokenPayload.ResourceGroup),
        urlParse.PathEscape(a.TokenPayload.WorkspaceName))
    payload := &WorkspaceRequest{
        Properties: &WsProps{ManagedResourceGroupID: "/subscriptions/" + a.TokenPayload.SubscriptionID + "/resourceGroups/" + a.TokenPayload.ManagedResourceGroup},
        Name:       a.TokenPayload.WorkspaceName,
        Location:   a.TokenPayload.AzureRegion,
    }
    headers := map[string]string{
        "Content-Type":  "application/json",
        "cache-control": "no-cache",
        "Authorization": "Bearer " + a.ManagementToken,
    }

    var responseMap map[string]interface{}
    resp, err := service.PerformQuery(config, http.MethodPut, url, "2.0", headers, true, true, payload, nil)
    if err != nil {
        return err
    }

    log.Println("RESPONSE DEBUG WORKSPACE ID")
    log.Println(string(resp))
    err = json.Unmarshal(resp, &responseMap)
    if err != nil {
        return err
    }
    a.AdbWorkspaceResourceID = responseMap["id"].(string)
        return nil
} 

Edited: Proposed change to GET request rather than PUT:

func (a *AzureAuth) getWorkspaceID(config *service.DBApiClientConfig) error {
    log.Println("[DEBUG] Getting Workspace ID via management token.")
    // Escape all the ids
    url := fmt.Sprintf("https://management.azure.com/subscriptions/%s/resourceGroups/%s"+
        "/providers/Microsoft.Databricks/workspaces/%s",
        urlParse.PathEscape(a.TokenPayload.SubscriptionID),
        urlParse.PathEscape(a.TokenPayload.ResourceGroup),
        urlParse.PathEscape(a.TokenPayload.WorkspaceName))
    headers := map[string]string{
        "Content-Type":  "application/json",
        "cache-control": "no-cache",
        "Authorization": "Bearer " + a.ManagementToken,
    }
    type apiVersion struct {
        ApiVersion string `url:"api-version"`
    }
    uriPayload := apiVersion{
        ApiVersion: "2018-04-01",
    }
    var responseMap map[string]interface{}
    resp, err := service.PerformQuery(config, http.MethodGet, url, "2.0", headers, false, true, uriPayload, nil)
    if err != nil {
        return err
    }
    err = json.Unmarshal(resp, &responseMap)
    if err != nil {
        return err
    }
    a.AdbWorkspaceResourceID = responseMap["id"].(string)
    return err
}

This was oversight on my part.

This should prevent the management api from creating the workspace, and rather just fail.

chriscrcodes commented 4 years ago

@stikkireddy Thanks! What is your advice then on how to automate Databricks workspace creation and cluster creation in a DevOps pipeline? We should be able to add a dependency between the resources, right? Or move Terraform files in two different directories (one for the workspace, without the Databricks provider and one for the cluster, referencing the workspace created, and using the Databricks provider)? Any idea?

stikkireddy commented 4 years ago

@christophecremon @stuartleeks so I mulled over this a bit more in terms of what the problem is and how terraform "configures" providers. So I came up with a nifty solution using pointers and mutexes. So currently we pass a value of the client to the resource which is generated by the configure function. Rather than doing this, it may be better to pass along a pointer. This way we do not need to needlessly generate client objects per call.

This seems to work normally when testing. All attributes of the client are not altered in any shape or form except for the token. There will now be a new "custom" function attribute to the DBApiClientConfig with the signature CustomAuthorizer func(*DBApiClientConfig) error.

The Provider configure function will be returning a pointer to the client so all resources access the same client:

    var dbClient service.DBApiClient
    dbClient.SetConfig(&config)
    return &dbClient, nil

And an additional function:

func (c *DBApiClientConfig) Authenticate() error {
    if c.CustomAuthorizer != nil {
        clientAuthorizerMutex.Lock()
        defer clientAuthorizerMutex.Unlock()
        if reflect.ValueOf(c.Token).IsZero() {
            return c.CustomAuthorizer(c)
        }
        return nil
    }
    return nil
}

calling authenticate in the api

// Groups returns an instance of GroupsAPI
func (c *DBApiClient) Groups() GroupsAPI {
    err := c.Config.Authenticate()
    if err != nil {
        panic(err)
    }
    return GroupsAPI{Client: c}
}

This can be called every time the "APIObjects" are called. This allows plans to not throw errors even if the workspace does not exist but it will throw runtime errors if the workspace does not exist. This is because during the api calls, the api object creation will call the authorize function. The reason behind the mutex is because all the resources objects will be referring to the single client pointer rather than individual copies. So we do not want any race conditions when they call the authenticate function in parallel so hence the mutex.

Did a dry run and everything seems to be working.

  1. Plan is not creating a workspace.
  2. Plan is not failing because the workspace does not exist.
  3. Apply works correctly with the pointers being dereferenced.
  4. No race conditions on authenticate when Terraform is creating resources in parallel.
chriscrcodes commented 4 years ago

@stikkireddy Thanks, can we test it?

stikkireddy commented 4 years ago

@christophecremon apologies that this has a been a bit slow, was on vacation. I should have a PR for this later tomorrow (EST).

stuartleeks commented 4 years ago

Hey @stikkireddy - we've just hit this issue in the work we're doing. If you have a branch with the PR work in we can give it a spin if that helps?

stuartleeks commented 4 years ago

76 is an attempt to work around this. Gave it some testing over the weekend and have been using it this morning and it seems to be working fine