Open huydinhle opened 2 years ago
There is no explicit dependency between databricks_mws_credentials
, aws_iam_role_policy
. If you print your terraform graph it will probably try to create policy attachment in parallel with databricks_mws_credentials
. The databricks_mws_credentials
will fail because the expected policy is not there. (race condition)
To resolve this add the following to databricks_mws_credentials.this
:
depends_on = [
aws_iam_role_policy. this
]
Can you confirm that this is not what is happening? If it is then this is just terraform behavior.
I tried it and still the exact same issue happened @stikkireddy
│ Error: cannot create mws credentials: MALFORMED_REQUEST: Failed credential validation checks: please use a valid cross account IAM role with permissions setup correctly
│
│ with databricks_mws_credentials.this,
│ on mws-workspace.tf line 49, in resource "databricks_mws_credentials" "this":
│ 49: resource "databricks_mws_credentials" "this" {
@stikkireddy I also did the inline policy setup and it still failed with the same reason https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/iam_role#example-of-exclusive-inline-policies
Can you confirm the policy and assume role on aws are created correctly? With the write externalid clause in the assume role/trust relationship and the right policy for the attached policy from our docs: https://docs.databricks.com/administration-guide/account-api/iam-role.html#create-a-cross-account-role-and-an-access-policy
If this looks all right and it does not work can you log a support ticket with Databricks. They have access to logs to help facilitate better debugging.
@stikkireddy I think the policy and assume role are both created correctly because on the 2nd time I did the terraform apply
( and I tried it a few times), it works immediately meaning that the databricks_mws_credentials are created without any errors.
It looks like to be a consistency issue to be honest. From what I am guessing, the policy or the role isn't 100% ready for databricks_credentials to use in the first time. By the second time you do the apply, policy and assume role are ready and databricks_mws_credentials are created without any problem
@huydinhle have you already created a support ticket for this? if not - please do and copy-paste the title/number of it here. I'll make sure it's routed to a proper platform team.
I was experiencing the exact same problem when spinning up workspaces from Terraform, which I could not fix with explicit dependencies through depends_on. Same as OP, my first apply would fail on the credentials, only to succeed on the next apply (probably with AWS given enough time to properly register the x-account IAM role).
I ended up fixing it like this for the time being:
resource "time_sleep" "wait_for_cross_account_role" {
depends_on = [aws_iam_role_policy.this, aws_iam_role.cross_account_role]
create_duration = "20s"
}
# Generate credentials to create and thereafter enter the Databricks workspace
resource "databricks_mws_credentials" "this" {
provider = databricks.mws
account_id = var.databricks_account_id
role_arn = aws_iam_role.cross_account_role.arn
credentials_name = "${var.workspace_name}-creds"
depends_on = [time_sleep.wait_for_cross_account_role]
}
we may be able to add retries in terraform provider for this, but we're still figuring out the best way on what to retry.
We're also experiencing a similar issue when creating a databricks_storage_credential
at the same time as the corresponding IAM role that it depends on. We're currently also working with an explicit time_sleep
like @jschra in order to not have to manually retrigger our pipelines. But a retry to allow for IAM propagation would be also highly appreciated for that case.
@neinkeinkaffee I assume you could already build the provider for local deployment. Could you experiment with resource.RetryContext
in the Create method here? Probably string.Contains
on valid cross account IAM role
would be good enough for now... meanwhile i'm trying to get a better error code to retry on from the platform side.
Sounds good, I'll experiment with that 👍
Like this maybe? https://github.com/databricks/terraform-provider-databricks/pull/1454
This is for the databricks_storage_credential
case but databrichs_mws_credential
would be similar except for the error to be caught. In the databricks_storage_credential
case it's actually a 403 that we're currently circumventing with our time_sleep
, with the message "Failed to get credentials: AWS IAM role in the metastore Data Access Configuration is not configured correctly. Please contact your account admin to update the configuration."
@neinkeinkaffee yes, looks like something that could stop the bleeding.
What are the results of running that in your environment? How many retries do you usually get?
Just checked. A typical number of retries is four (so the fifth request is the successful one), with 1, 2, 4, and 8 seconds in between.
I am having a similar issue. I used to get over this problem earlier with a depends_on block for IAM Role. But with the latest changes, I can't seem to use a depends_on block and i get this below error
`--- FAIL: TestSetupDatabricksWorkspace (15.51s) apply.go:15: Error Trace: apply.go:15 create_e2e_test.go:44 Error: Received unexpected error: FatalError{Underlying: error while running command: exit status 1; There are some problems with the configuration, described below.
The Terraform configuration must be valid before initialization so that
Terraform can determine which modules and providers need to be installed.
╷
│ Error: Module module.create_workspace contains provider configuration
│
│ Providers cannot be configured within modules using count, for_each or
│ depends_on.`
As defined here at https://www.terraform.io/language/modules/develop/providers#legacy-shared-modules-with-provider-configurations
Is there any work around?
I've just had this same issue too using 1.5.0 of the databricks provider. I'm currently using the workaround suggested earlier in the discussion by @jschra!
Looks like there's a few seconds lag between API call from AWS returning for setting up the IAM role and it being ready for use by the databricks validation check :(
I'm hoping the PR isn't too far off from getting merged to fix this!
any updates on this?
Here is the error block I see at the end. Seems to be related to there SDK and a timing issue. The sleep found above helped me.
╷
│ Error: cannot create mws credentials: unexpected error handling request: invalid character 'M' looking for beginning of value. This is likely a bug in the Databricks SDK for Go or the underlying REST API. Please report this issue with the following debugging information to the SDK issue tracker at https://github.com/databricks/databricks-sdk-go/issues. Request log:
│ ```
│ POST /api/2.0/accounts/{ACCOUNT_ID_HERE}/credentials
│ > * Host:
│ > * Accept: application/json
│ > * Authorization: REDACTED
│ > * Content-Type: application/json
│ > * Traceparent: UUID_HERE
│ > * User-Agent: databricks-tf-provider/1.47.0 databricks-sdk-go/0.41.0 go/1.21.10 os/darwin terraform/1.8.5 resource/mws_credentials auth/databricks-cli
│ > {
│ > "aws_credentials": {
│ > "sts_role": {
│ > "role_arn": "arn:aws:iam::ACCOUNT_ID:role/ROLE_NAME"
│ > }
│ > },
│ > "credentials_name": "CREDENTIALS_NAME"
│ > }
│ < HTTP/2.0 400 Bad Request
│ < * Content-Type: text/plain; charset=utf-8
│ < * Date: Wed, 12 Jun 2024 18:21:00 GMT
│ < * Server: databricks
│ < * Strict-Transport-Security: max-age=31536000; includeSubDomains; preload
│ < * Vary: Accept-Encoding
│ < * X-Content-Type-Options: nosniff
│ < MALFORMED_REQUEST: Failed credential validation checks: please use a valid cross account IAM role with permissions setup correctly.
│ ```
Hi there,
Thank you for opening an issue. Please note that we try to keep the Databricks Provider issue tracker reserved for bug reports and feature requests. For general usage questions, please see: https://www.terraform.io/community.html.
Configuration
Expected Behavior
Terraform apply should have just worked.
Actual Behavior
Terraform applied failed with
Steps to Reproduce
Please list the steps required to reproduce the issue, for example:
terraform apply
Terraform and provider versions
Important Factoids
You can do a workaround with doing
terraform apply
. The 2nd time the apply will be successful and mws_credentials will be created