[ISSUE] UnknownWorkerEnvironmentException when creating a cluster after creating a workspace

sdebruyn commented 4 years ago

Terraform Version

➜ terraform -v
Terraform v0.12.24
+ provider.azuread v0.8.0
+ provider.azurerm v2.6.0
+ provider.databricks v0.1.0
+ provider.http v1.2.0
+ provider.local v1.4.0
+ provider.null v2.1.2
+ provider.random v2.2.1

Affected Resource(s)

databricks_cluster

Terraform Configuration Files

Same one as in #21

Debug Output

2020-05-04T10:45:06.969+0200 [DEBUG] plugin.terraform-provider-databricks_v0.1.0: 2020/05/04 10:45:06 {"Method":"GET","URI":"https://eastus2.azuredatabricks.net/api/2.0/clusters/list-node-types?"}
2020/05/04 10:45:07 [ERROR] eval: *terraform.EvalConfigProvider, err: status 400: err Response from server {"error_code":"INVALID_PARAMETER_VALUE","message":"Delegate unexpected exception during listing node types: com.databricks.backend.manager.util.UnknownWorkerEnvironmentException: Unknown worker environment WorkerEnvId(workerenv-3375316063940170)"}

Error: status 400: err Response from server {"error_code":"INVALID_PARAMETER_VALUE","message":"Delegate unexpected exception during listing node types: com.databricks.backend.manager.util.UnknownWorkerEnvironmentException: Unknown worker environment WorkerEnvId(workerenv-1918878560143470)"}

  on databricks.tf line 10, in provider "databricks":
  10: provider "databricks" {

Expected Behavior

After creating the workspace, we should be able to create the cluster during the same apply run.

Actual Behavior

When you create a workspace and terraform goes on to immediately create a cluster, you get the mentioned exception. It works when you apply a second time after a few seconds.

Steps to Reproduce

terraform apply

References

This third party databricks provider has the same issue

https://github.com/innovationnorway/terraform-provider-databricks/issues/49

stikkireddy commented 4 years ago

@sdebruyn this comment https://github.com/databrickslabs/databricks-terraform/pull/27/files#r419517843 should contain information about how to temporarily address this. This ties back to issue #21.

lawrencegripper commented 4 years ago

We're also seeing this in recent tests, I can validate that a retry like the one added in #27 should resolve the issue as I did something similar in a hacked shell provider doing cluster creation here.

stikkireddy commented 4 years ago

@lawrencegripper so does retrying the api fix the issue? So what is the timeout value (like 20-30 minutes) or an infinite retry? I was thinking that extending the retry logic to a longer time out should work.

lawrencegripper commented 4 years ago

Yeah retrying does fix the issue for me in all the testing I've done. Looks like the fix you put in on the PR should do the trick, I prefer your 30min timeout to my infinite loop as if things haven't worked after 30mins they're unlikely to work ever!

Nice work on the workaround, I'll try and give this a test tomorrow.

sdebruyn commented 4 years ago

I just now got this issue again, the timeout might not be long enough

Error: status 400: err Response from server {"error_code":"BAD_REQUEST","message":"Current organization 6286703153333741 does not have any associated worker environments"}

  on databricks.tf line 23, in resource "databricks_cluster" "cluster":
  23: resource "databricks_cluster" "cluster" {

lawrencegripper commented 4 years ago

Yeah I'm also seeing this error too today, going to take a look. Looks like the cluster create needs retry logic too

lawrencegripper commented 4 years ago

So my working theory here is that the workspace cluster api can go through two states, both related to this bug but both returning different error messages:

com.databricks.backend.manager.util.UnknownWorkerEnvironmentException
Current organization 6286703153333741 does not have any associated worker environments

At some point the workspace goes from nonexistant -> 1 -> 2 -> working.

The current retry logic is handling for the 1st of these but not the second.

We can either:

update the conditional logic in the current model to check for either of these two types of error.
retry any non-200 error code from dbClient.Clusters().ListNodeTypes()

Just having more of a play now to try and repro this and prove the different error messages but it's slow going as they're super intermittent for me. Think I'm going to write a script which spins up workspace in parallel and polls the clusters endpoint straight after creation to log the results, strip all the rest of the stuff out of the mix.

lawrencegripper commented 4 years ago

So here is my experiments results. I created this script which creates a Workspace then polls it (well it does 5 at the same time just to compare)

Here is an example, when polling ListNodes. The error goes away then comes back in the middle, polling around every 3/4 secs.

I don't see any of the other error of form Current organization 6286703153333741 does not have any associated worker environments type.

My working theory is that these are returned from a different call type and never come back from the listnode endpoint.

I'm going to rerun the script with clusters/list endpoint to see if that repros.

lawrencegripper commented 4 years ago

No difference with the clusters/list endpoint added. Next on my list would be to attempt a cluster/create (maybe stop the script once a cluster has been successfully created) cc @stuartleeks

lawrencegripper commented 4 years ago

So I've got an answer, shortly after creating a workspace while it's listed as "provisioningState": "Succeeded" in Azure the APIs will return errors for a period of time.

The period of time is not continuous. At points an error will occur, appear to resolve then reoccur again. My guess would be some form of load balancer is sending the requests to various nodes in a webfarm/cluster some of which have an updated view of the new workspace some of which don't.

As well as this different endpoints will independently error with different messages.

clusters/list-node-types errors with com.databricks.backend.manager.util.UnknownWorkerEnvironmentException

clusters/create errors with Current organization 6286703153333741 does not have any associated worker environments

At some points clusters/list-node-types does not error while a call to clusters/create will error.

I've updated the script here to better track this, an example output is here tracking the time and the response to the various calls.

As the current workaround is checking clusters/list-node-types and then the client continues to act as normal this will fail to resolve the issue for two reasons:

The check is a one off and an error may appear resolved but then reoccur as shown in the example log.
The clusters/create endpoint isn't checked so could still exhibit issues even thought the list-node-types call is succeeding.

Proposed fix

It's not pretty but we could keep the current validateAPI call in place with the addition that once an error is observed at least 5 consecutive calls to list-node-types succeed before returning. This is a hack but will do the trick for now.

Additionally to protect against the errors on calls like clusters/create wrap all calls to the API from the provider in a modest exponential retry/backoff. This could be done with something like hashicorps retryablehttp replacement for httpclient. This feels like a nice addition to make the provider more resilient to network issues or blips while also helping in this case as roughly when cluster/list-node-types starts working clusters/create isn't far behind.

What do you think? Happy to look at picking these changes up and testing out. With the work we have in #51 to automate full environment creation (workspace and all) it's easier to reproduce.

sdebruyn commented 4 years ago

Since you suggested that it might somehow be a load balancer that redirects to different instances, maybe #34 might be helpful since then it would immediately go to the right instance?

lawrencegripper commented 4 years ago

Interesting, will adapt the test script and re-run to see if using the workspaceURL rather than location based one makes a difference (caveat the load balancer stuff is me guessing, I don't know how this is wired up behind the scenes).

Small world moment, the issue in AzureRM about adding adding an attribute to allow Azure Terraform to output the workspace URL is one I raised and was hoping to go fix, https://github.com/terraform-providers/terraform-provider-azurerm/issues/6732, it's currently blocked as it requires an SDK change to come it but after that hopefully good to go, not sure on the timelines tho.

lawrencegripper commented 4 years ago

@sdebruyn sadly using the direct workspace URL doesn't make a difference. Below you can see the highlighted lines which have succeeded in calling clusters/list-node-types in between the calls before and after which fail to call clusters/list-node-types. Changes to the script are here if you want to validate yourself https://github.com/stuartleeks/databricks-terraform/commit/def0ce3268c30bc8905a968a60c3840f481df56d

Given this what do you think of the proposed fix?

lawrencegripper commented 4 years ago

One additional test I ran quickly was to validate what error code is being returned by each error. In both instances the error is a 400 - Bad Request this makes the retry logic harder. Ideally with a server error you'd expect to receive a 500. It's not a world ender as you can use the same logic currently in validateWorkspaceApis method and check the body for a matching string of the known error.

Here is the full response in both scenarios

HTTP/1.1 400 Bad Request
server: databricks
date: Wed, 13 May 2020 06:46:20 GMT
content-length: 247
strict-transport-security: max-age=31536000; includeSubDomains; preload
x-content-type-options: nosniff
content-type: application/json

{"error_code":"INVALID_PARAMETER_VALUE","message":"Delegate unexpected exception during listing node types: com.databricks.backend.manager.util.UnknownWorkerEnvironmentException: Unknown worker environment WorkerEnvId(workerenv-5365497200078827)"}

HTTP/1.1 400 Bad Request
server: databricks
date: Wed, 13 May 2020 06:46:13 GMT
content-length: 126
strict-transport-security: max-age=31536000; includeSubDomains; preload
x-content-type-options: nosniff
content-type: application/json

{"error_code":"BAD_REQUEST","message":"Current organization 477765680641509 does not have any associated worker environments"}

Having thought overnight on this more I think the best option is to move the logic in validateWorkspaceAPIs so that a similar check is performed on all requests to the Databricks API and any with errors like the above are retried.

Sound like a good plan?

sdebruyn commented 4 years ago

I agree with you. Meanwhile I'm hoping that someone from the Azure team picks this up and update the provisioning state accordingly. That would avoid all these workarounds.

algattik commented 4 years ago

As a workaround in my Python API client library, I've implemented this:

--

Usage with a newly provisioned workspace If using this module as part of a provisioning job, you need to call client.ensure_available().

When the first user logs it to a new Databricks workspace, workspace provisioning is triggered, and the API is not available until that job has completed (that usually takes under a minute, but could take longer depending on the network configuration). In that case you would get an error such as the following when calling the API:

"Succeeded{"error_code":"INVALID_PARAMETER_VALUE","message":"Unknown worker environment WorkerEnvId(workerenv-4312344789891641)"} The method client.ensure_available(url="instance-pools/list", retries=100, delay_seconds=6) prevents this error by attempting to connect to the provided URL and retries as long as the workspace is in provisioning state, or until the given number of retries has elapsed.

stuartleeks commented 4 years ago

@stikkireddy are you ok to re-open this issue?

stuartleeks commented 4 years ago

Hi @algattik - thanks for your comment. In the discussion above, @lawrencegripper has put the results from his testing. Unfortunately, we're seeing that we can get a successful call to the API followed by a failing call :-(

stikkireddy commented 4 years ago

@stuartleeks hey reopened this issue, @lawrencegripper might be worth using the retryablehttp client in the client provider, but this will be more complicated to retry as it throws a standard 400 bad request instead of a server side 5xx error.

stuartleeks commented 4 years ago

I think the retryablehttp will fit in quite well. Have been looking at that this afternoon. Need to do some tidying up and then try it out 😃

stikkireddy commented 4 years ago

Closing this as @stuartleeks was kind enough to implement the retryablehttp client as the base client implementation for the databricks go sdk. It has a set of transienterrors that it retries upon seeing and any future issues regarding this with a new error message will be added into that list of errors that we need to retry upon.

databricks / terraform-provider-databricks