databricks / terraform-provider-databricks

Databricks Terraform Provider
https://registry.terraform.io/providers/databricks/databricks/latest
Other
444 stars 384 forks source link

[FEATURE] Catalog-Workspace Bindings #2312

Closed ehab-eb closed 1 year ago

ehab-eb commented 1 year ago

Use-cases

Attempted Solutions

N/A

Proposal

Support for the new Workspace Bindings API.

There are a few options for implementation but I see these as the most reasonable ones:

  1. Add a new argument to the databricks_catalog resource

    • Advantages:
      • No new resources for users to keep track of
      • Ensuring that the api will always be invoked in a Unity Catalog context
    • Disadvantages:
      • Tight coupling with the catalog API which could potentially lead to issues down the road if, for example, the catalog api is enabled at account-level while the binding api isn't
      • Limited access to the binding api in that the catalog has to be created using Terraform to be able to also bind it using Terraform
      • There isn't a data source currently to retrieve details about catalog(s), apart from ids, which would mean the GET method has no consistent implementation for now.
  2. Create a new resource - For example, databricks_catalog_binding

    • Advantages:
      • No coupling with other resources
      • Would be more consistent with a similar data source to retrieve catalog bindings. e.g. data "databricks_catalog_binding" and data "databricks_catalog_bindings"
    • Disadvantages:
      • Being a separate resource, it will require more arguments and checks compared to the first option such as specifying the metastore and catalog ids.
  3. Supporting both of the above options in a non-exclusive manner. The AWS provider implements a similar strategy for policy attachments in aws_iam_role_policy_attachment and aws_iam_role

    • Advantages:
      • Best of both worlds - Allowing flexibility for users to bind the catalogs as needed in at any point in the stack.
    • Disadvantages:
      • More work to implement
      • More user options = More developer issues

References

zcking commented 1 year ago

I like option 2 and think a usage like the following would make sense:

resource "databricks_catalog_binding" {
  catalog_name = databricks_catalog.dev.name
  workspace_id = databricks_mws_workspaces.dev.workspace_id
}

IMO this would be flexible because it allows decentralized assignments (use the example above but in multiple places), as well as a centralized assignment to multiple workspaces with for_each :

variable "workspace_ids" {
  type = list(string)
  default = []
}

resource "databricks_catalog_binding" {
  for_each = var.workspace_ids
  catalog_name = databricks_catalog.dev.name
  workspace_id = each.value
}

Likewise, it aligns with the REST API nicely for data queries too.

data "databricks_catalog_binding" {
  catalog_name = databricks_catalog.dev.name
}

output "dev_workspace_ids" {
  value = data.databricks_catalog_binding.workspace_ids
}
ehab-eb commented 1 year ago

@nkvuong I can give this a go if we agree on an approach? It would be my first time using go though so I may need some guidance along the way.

nkvuong commented 1 year ago

hey @ehab-eb, sorry I've been quite busy so missed this one. Please feel free to take a stab at this if you're up for it, happy to discuss some more details about what you need to do, a good example to follow is #2324

In my opinion, option 1 is not ideal as it requires multiple APIs call when creating a catalog, which is brittle and more prone to failure.

Option 2 is great, and I do like the idea of having a pair of workspace ID - catalog name per resource, which translates to assign_workspaces for create, unassign_workspaces for delete and combination of both for update

However, for option 2 to work, you also need to update the catalog isolation mode, which needs to be done by updating the catalog resource (this PR #2357)

ehab-eb commented 1 year ago

@nkvuong Awesome! I agree that option 2 makes more sense.

However, I wonder if this resource should do any terraform updates at all - After playing around with the API, it seems that when using the PATCH endpoint, it doesn't patch the whole config but just adds any workspaces in the assign_workspaces parameter, and, similarly, removes any workspaces in the unassign_workspaces parameter.

It also doesn't raise any errors if you try to assign a workspace that's already assigned or if you do the same thing with unassigning.

If the terraform resource will look like this,

resource "databricks_catalog_binding" "this" {
    catalog_name = "example_catalog"
    workspace_id = "1234567890101112"
}

then I would imagine that changing either of the argument values should result in terraform forcing a replacement of the resource?

nkvuong commented 1 year ago

@ehab-eb it does make sense to just force replacement of the resource rather than trying to update. The edge case that we need to handle if a catalog is renamed (won't happen often but may come across), because the API would fail in that case, so the delete needs to happen gracefully.

One drawback with this design is that we won't be able to support importing existing catalog bindings, but it is better than having Terraform overriding the whole bindings every time.

If you want to take a stab at drafting a PR, I'll happily review and help with getting through integration testing and approval.