databricks / terraform-provider-databricks

Databricks Terraform Provider
https://registry.terraform.io/providers/databricks/databricks/latest
Other
454 stars 392 forks source link

[FEATURE] Better Error Handler/Response #2989

Closed oliveirafilipe closed 10 months ago

oliveirafilipe commented 11 months ago

Use-cases

While trying to create a new workspace, with a networking config already in use an error is generated, as expected. But the reason is not shown for the user in terraform apply. Below is a comparison of error responses among a few databricks interfaces:

CURL:

curl -XPOST -H "Authorization: Bearer $TOKEN" -d '{
   "account_id": "foo_hex_account_id",
   "aws_region": "us-east-2",
   "credentials_id": "foo_hex_credentials_id",
   "is_no_public_ip_enabled": true,
   "network_id": "foo_hex_network_id",
   "storage_configuration_id": "foo_hex_storage_configuration_id",
   "workspace_name": "my-foo-workspace"
}' 'https://accounts.cloud.databricks.com/api/2.0/accounts/my-foo-account/workspaces'
# Response
{"message":"MALFORMED_REQUEST: Network foo_hex_network_id is used by another workspace <workspace_ID>."} 
curl -vvv as asked by @mgyucht ``` $ curl -XPOST -vvv -H "Authorization: Bearer $TOKEN" -d '{ "account_id": "foo_hex_account_id", "aws_region": "us-east-2", "credentials_id": "foo_hex_credentials_id", "is_no_public_ip_enabled": true, "network_id": "foo_hex_network_id", "storage_configuration_id": "foo_hex_storage_configuration_id", "workspace_name": "my-foo-workspace" }' 'https://accounts.cloud.databricks.com/api/2.0/accounts/my-foo-account/workspaces' Note: Unnecessary use of -X or --request, POST is already inferred. * Trying 44.225.144.140:443... * Connected to accounts.cloud.databricks.com (44.225.144.140) port 443 (#0) * ALPN: offers h2,http/1.1 * (304) (OUT), TLS handshake, Client hello (1): * CAfile: /etc/ssl/cert.pem * CApath: none * (304) (IN), TLS handshake, Server hello (2): * (304) (OUT), TLS handshake, Client hello (1): * (304) (IN), TLS handshake, Server hello (2): * (304) (IN), TLS handshake, Unknown (8): * (304) (IN), TLS handshake, Certificate (11): * (304) (IN), TLS handshake, CERT verify (15): * (304) (IN), TLS handshake, Finished (20): * (304) (OUT), TLS handshake, Finished (20): * SSL connection using TLSv1.3 / AEAD-CHACHA20-POLY1305-SHA256 * ALPN: server accepted h2 * Server certificate: * subject: C=US; ST=California; L=San Francisco; O=Databricks Inc.; CN=*.cloud.databricks.com * start date: Jun 16 00:00:00 2023 GMT * expire date: Jun 14 23:59:59 2024 GMT * subjectAltName: host "accounts.cloud.databricks.com" matched cert's "*.cloud.databricks.com" * issuer: C=US; O=DigiCert Inc; CN=DigiCert Global G2 TLS RSA SHA256 2020 CA1 * SSL certificate verify ok. * using HTTP/2 * h2 [:method: POST] * h2 [:scheme: https] * h2 [:authority: accounts.cloud.databricks.com] * h2 [:path: /api/2.0/accounts/my-foo-account/workspaces] * h2 [user-agent: curl/8.1.2] * h2 [accept: */*] * h2 [authorization: Bearer $TOKEN] * h2 [content-length: 358] * h2 [content-type: application/x-www-form-urlencoded] * Using Stream ID: 1 (easy handle 0x14f012e00) > POST /api/2.0/accounts/my-foo-account/workspaces HTTP/2 > Host: accounts.cloud.databricks.com > User-Agent: curl/8.1.2 > Accept: */* > Authorization: Bearer $TOKEN > Content-Length: 358 > Content-Type: application/x-www-form-urlencoded > * We are completely uploaded and fine < HTTP/2 400 < date: Mon, 04 Dec 2023 17:51:17 GMT < x-frame-options: SAMEORIGIN < x-xss-protection: 1; mode=block < content-type: application/json;charset=iso-8859-1 < content-length: 123 < strict-transport-security: max-age=31536000; includeSubDomains; preload < x-content-type-options: nosniff < vary: Accept-Encoding < server: databricks < * Connection #0 to host accounts.cloud.databricks.com left intact {"message":"MALFORMED_REQUEST: Network foo_hex_network_id is used by another workspace ."} ```

UI:

Screenshot 2023-12-04 at 10 46 39

Terraform:

Toggle for TF_LOG=debug content ``` databricks_mws_workspaces.this: Creating... 2023-12-01T18:20:03.692-0300 [INFO] Starting apply for databricks_mws_workspaces.this 2023-12-01T18:20:03.692-0300 [DEBUG] databricks_mws_workspaces.this: applying the planned Create change 2023-12-01T18:20:06.250-0300 [DEBUG] provider.terraform-provider-databricks: non-retriable error: Bad Request: tf_provider_addr=registry.terraform.io/databricks/databricks tf_req_id=2f378374-9d0b-4c66-591c-99ddedcd76df tf_resource_type=databricks_mws_workspaces tf_rpc=ApplyResourceChange @caller=/Users/foo/terraform-provider-databricks/logger/logger.go:33 @module=databricks timestamp=2023-12-01T18:20:06.250-0300 2023-12-01T18:20:06.250-0300 [DEBUG] provider.terraform-provider-databricks: POST /api/2.0/accounts//workspaces > { > "account_id": "foo_hex_account_id", > "aws_region": "us-east-2", > "credentials_id": "foo_hex_credentials_id", > "is_no_public_ip_enabled": true, > "network_id": "foo_hex_network_id", > "storage_configuration_id": "foo_hex_storage_configuration_id", > "workspace_name": "my-foo-workspace" > } < HTTP/2.0 400 Bad Request < [non-JSON document of 15 bytes]. : tf_provider_addr=registry.terraform.io/databricks/databricks @caller=/Users/foo/terraform-provider-databricks/logger/logger.go:33 tf_rpc=ApplyResourceChange @module=databricks tf_req_id=2f378374-9d0b-4c66-591c-99ddedcd76df tf_resource_type=databricks_mws_workspaces timestamp=2023-12-01T18:20:06.250-0300 2023-12-01T18:20:06.250-0300 [ERROR] provider.terraform-provider-databricks: Response contains error diagnostic: tf_proto_version=5.4 tf_provider_addr=registry.terraform.io/databricks/databricks tf_req_id=2f378374-9d0b-4c66-591c-99ddedcd76df @caller=/Users/foo/terraform-provider-databricks/vendor/github.com/hashicorp/terraform-plugin-go/tfprotov5/internal/diag/diagnostics.go:58 diagnostic_detail= diagnostic_severity=ERROR diagnostic_summary="cannot create mws workspaces: Bad Request" tf_rpc=ApplyResourceChange @module=sdk.proto tf_resource_type=databricks_mws_workspaces timestamp=2023-12-01T18:20:06.250-0300 2023-12-01T18:20:06.251-0300 [ERROR] vertex "databricks_mws_workspaces.this" error: cannot create mws workspaces: Bad Request ```
databricks_mws_workspaces.this: Creating...
-│ Error: cannot create mws workspaces: Bad Request
 │ 
 │   with databricks_mws_workspaces.this,
 │   on workspace.tf line 1, in resource "databricks_mws_workspaces" "this":
 │    1: resource "databricks_mws_workspaces" "this" {

Attempted Solutions

While debugging the provider I found out that the reason for a poor feedback error was that the API was returning an HTML content.

This is the stdout when I added fmt.Println(response.Header.Get("Content-Type")) after this line in the SDK

2023-12-01T18:20:06.249-0300 [WARN]  unexpected data: registry.terraform.io/databricks/databricks:stdout text/html;charset=iso-8859-1

The TF_LOG=DEBUG content resonates with that:

< HTTP/2.0 400 Bad Request
< [non-JSON document of 15 bytes]. <io.ReadCloser>[...]

I couldn't reproduce the HTML error response while using curl

Proposal

Provide better feedback on an error situation. Probably by making sure that the API Response is in JSON not in HTML

References

mgyucht commented 11 months ago

@oliveirafilipe Thank you for filing this issue! This is likely an issue with the underlying Go SDK. To help us debug, could you include the output of the curl command with the -v flag?

Separately, we'll change the Go SDK to include more debugging output when receiving these unexpected response types.

oliveirafilipe commented 11 months ago

@mgyucht Thank you for your fast interaction. I updated the issue content with your request.

Just to make it clear, an HTTP 400 Bad Request is expected in this scenario, but I assume that the Content-Type: text/html;charset=iso-8859-1 in the API response is not.

I like to add that for some other reasons, like wrong account id in the Path, the API returns HTML:

Example of HTML response from API ``` $ curl -XPOST -vvv -H "Authorization: Bearer $TOKEN" -d '{ "account_id": "foo_hex_account_id", "aws_region": "us-east-2", "credentials_id": "foo_hex_credentials_id", "is_no_public_ip_enabled": true, "network_id": "foo_hex_network_id", "storage_configuration_id": "foo_hex_storage_configuration_id", "workspace_name": "my-foo-workspace" }' 'https://accounts.cloud.databricks.com/api/2.0/accounts/A-WRONG-ACCOUNT_ID/workspaces' Note: Unnecessary use of -X or --request, POST is already inferred. * Trying 44.241.99.218:443... * Connected to accounts.cloud.databricks.com (44.241.99.218) port 443 (#0) * ALPN: offers h2,http/1.1 * (304) (OUT), TLS handshake, Client hello (1): * CAfile: /etc/ssl/cert.pem * CApath: none * (304) (IN), TLS handshake, Server hello (2): * (304) (OUT), TLS handshake, Client hello (1): * (304) (IN), TLS handshake, Server hello (2): * (304) (IN), TLS handshake, Unknown (8): * (304) (IN), TLS handshake, Certificate (11): * (304) (IN), TLS handshake, CERT verify (15): * (304) (IN), TLS handshake, Finished (20): * (304) (OUT), TLS handshake, Finished (20): * SSL connection using TLSv1.3 / AEAD-CHACHA20-POLY1305-SHA256 * ALPN: server accepted h2 * Server certificate: * subject: C=US; ST=California; L=San Francisco; O=Databricks Inc.; CN=*.cloud.databricks.com * start date: Jun 16 00:00:00 2023 GMT * expire date: Jun 14 23:59:59 2024 GMT * subjectAltName: host "accounts.cloud.databricks.com" matched cert's "*.cloud.databricks.com" * issuer: C=US; O=DigiCert Inc; CN=DigiCert Global G2 TLS RSA SHA256 2020 CA1 * SSL certificate verify ok. * using HTTP/2 * h2 [:method: POST] * h2 [:scheme: https] * h2 [:authority: accounts.cloud.databricks.com] * h2 [:path: /api/2.0/accounts/A-WRONG-ACCOUNT_ID/workspaces] * h2 [user-agent: curl/8.1.2] * h2 [accept: */*] * h2 [authorization: Bearer $TOKEN] * h2 [content-length: 358] * h2 [content-type: application/x-www-form-urlencoded] * Using Stream ID: 1 (easy handle 0x14e010a00) > POST /api/2.0/accounts/A-WRONG-ACCOUNT_ID/workspaces HTTP/2 > Host: accounts.cloud.databricks.com > User-Agent: curl/8.1.2 > Accept: */* > Authorization: Bearer $TOKEN > Content-Length: 358 > Content-Type: application/x-www-form-urlencoded > * We are completely uploaded and fine < HTTP/2 400 < cache-control: must-revalidate,no-cache,no-store < x-databricks-reason-phrase: Unable to load OAuth Config < content-type: text/html;charset=iso-8859-1 < content-length: 335 < vary: Accept-Encoding < date: Mon, 04 Dec 2023 18:02:37 GMT < server: databricks < Error 400 Unable to load OAuth Config

HTTP ERROR 400

Problem accessing /api/2.0/accounts/A-WRONG-ACCOUNT_ID/workspaces. Reason:

    Unable to load OAuth Config

* Connection #0 to host accounts.cloud.databricks.com left intact ```
oliveirafilipe commented 11 months ago

While trying to debug even more today (logs in issue description are from Dec 1st, 2023), I ran it now and it seems to be showing a better response

This log is from v1.30.0

 HTTP/2.0 400 Bad Request
 [non-JSON document of 15 bytes]. <io.ReadCloser>: @caller=/home/runner/work/terraform-provider-databricks/terraform-provider-databricks/logger/logger.go:33 @module=databricks tf_resource_type=databricks_mws_workspaces tf_rpc=ApplyResourceChange tf_provider_addr=registry.terraform.io/databricks/databricks tf_req_id=53b31d2f-0aad-8056-6c29-c59440307af2 timestamp=2023-12-04T15:27:49.851-0300
2023-12-04T15:27:49.852-0300 [ERROR] provider.terraform-provider-databricks_v1.30.0: Response contains error diagnostic: tf_proto_version=5.4 tf_provider_addr=registry.terraform.io/databricks/databricks tf_rpc=ApplyResourceChange @caller=/home/runner/work/terraform-provider-databricks/terraform-provider-databricks/vendor/github.com/hashicorp/terraform-plugin-go/tfprotov5/internal/diag/diagnostics.go:58 diagnostic_severity=ERROR diagnostic_summary="cannot create mws workspaces: MALFORMED_REQUEST: Network network_id is used by another workspace workspace_id" tf_req_id=53b31d2f-0aad-8056-6c29-c59440307af2 tf_resource_type=databricks_mws_workspaces @module=sdk.proto diagnostic_detail= timestamp=2023-12-04T15:27:49.851-0300
2023-12-04T15:27:49.853-0300 [ERROR] vertex "databricks_mws_workspaces.this" error: cannot create mws workspaces: MALFORMED_REQUEST: Network network_id is used by another workspace workspace_id
╷
- │ Error: cannot create mws workspaces: MALFORMED_REQUEST: Network network_id is used by another workspace workspace_id
│ 
│   with databricks_mws_workspaces.this,
│   on workspace.tf line 1, in resource "databricks_mws_workspaces" "this":
│    1: resource "databricks_mws_workspaces" "this" {

2023-12-04T15:51:54.080-0300 [DEBUG] provider.stdio: received EOF, stopping recv loop: err="rpc error: code = Unavailable desc = transport is closing"
2023-12-04T15:51:54.082-0300 [DEBUG] provider: plugin process exited: path=.terraform/providers/registry.terraform.io/databricks/databricks/1.30.0/darwin_arm64/terraform-provider-databricks_v1.30.0 pid=42378
2023-12-04T15:51:54.082-0300 [DEBUG] provider: plugin exited

Seems to me that this is something related to the information (or its format) returned by the API

This is the full TF_LOG=DEBUG stdout, including the fmt.Println(response.Header.Get("Content-Type")) (2nd line) added after this line in the SDK in f9d4e38f9f1c618a18590e2550aa6195ce94fdf2

2023-12-04T15:55:14.664-0300 [DEBUG] databricks_mws_workspaces.this: applying the planned Create change
=== HERE ===> 2023-12-04T15:55:17.494-0300 [WARN]  unexpected data: registry.terraform.io/databricks/databricks:stdout="application/json;charset=iso-8859-1"
2023-12-04T15:55:17.494-0300 [DEBUG] provider.terraform-provider-databricks: non-retriable error: MALFORMED_REQUEST: Network 76e88676-cef7-4f65-a80f-662e8a32d7de is used by another workspace 536520102525586.: @module=databricks tf_provider_addr=registry.terraform.io/databricks/databricks tf_req_id=f7f6530b-50fc-56e0-5469-15f3c70333dd @caller=/Users/foo/terraform-provider-databricks/logger/logger.go:33 tf_resource_type=databricks_mws_workspaces tf_rpc=ApplyResourceChange timestamp=2023-12-04T15:55:17.493-0300
2023-12-04T15:55:17.494-0300 [DEBUG] provider.terraform-provider-databricks: POST /api/2.0/accounts/<account-id>/workspaces
> {
>   "account_id": "foo_hex_account_id",
>   "aws_region": "us-east-2",
>   "credentials_id": "foo_hex_credentials_id",
>   "is_no_public_ip_enabled": true,
>   "network_id": "foo_hex_network_id",
>   "storage_configuration_id": "foo_hex_storage_configuration_id",
>   "workspace_name": "my-foo-workspace"
> }
< HTTP/2.0 400 Bad Request
< {
<   "message": "MALFORMED_REQUEST: Network network_id is used by another workspace workspace_id... (13 more bytes)"
< }: tf_req_id=f7f6530b-50fc-56e0-5469-15f3c70333dd @caller=/Users/foo/terraform-provider-databricks/logger/logger.go:33 @module=databricks tf_provider_addr=registry.terraform.io/databricks/databricks tf_resource_type=databricks_mws_workspaces tf_rpc=ApplyResourceChange timestamp=2023-12-04T15:55:17.493-0300
2023-12-04T15:55:17.494-0300 [ERROR] provider.terraform-provider-databricks: Response contains error diagnostic: @module=sdk.proto diagnostic_summary="cannot create mws workspaces: MALFORMED_REQUEST: Network network_id is used by another workspace workspace_id." tf_proto_version=5.4 tf_provider_addr=registry.terraform.io/databricks/databricks @caller=/Users/foo/terraform-provider-databricks/vendor/github.com/hashicorp/terraform-plugin-go/tfprotov5/internal/diag/diagnostics.go:58 tf_req_id=f7f6530b-50fc-56e0-5469-15f3c70333dd tf_rpc=ApplyResourceChange diagnostic_detail= diagnostic_severity=ERROR tf_resource_type=databricks_mws_workspaces timestamp=2023-12-04T15:55:17.494-0300
2023-12-04T15:55:17.496-0300 [ERROR] vertex "databricks_mws_workspaces.this" error: cannot create mws workspaces: MALFORMED_REQUEST: Network 76e88676-cef7-4f65-a80f-662e8a32d7de is used by another workspace 536520102525586.
╷
│ Error: cannot create mws workspaces: MALFORMED_REQUEST: Network network_id is used by another workspace workspace_id.
│ 
│   with databricks_mws_workspaces.this,
│   on workspace.tf line 1, in resource "databricks_mws_workspaces" "this":
│    1: resource "databricks_mws_workspaces" "this" {
│ 
╵
2023-12-04T15:55:17.500-0300 [DEBUG] provider.stdio: received EOF, stopping recv loop: err="rpc error: code = Unavailable desc = transport is closing"
2023-12-04T15:55:17.502-0300 [DEBUG] provider: plugin process exited: path=.terraform/providers/registry.terraform.io/databricks/databricks/9.9.1/darwin_arm64/terraform-provider-databricks pid=43017
oliveirafilipe commented 11 months ago

Hi, new scenario today 😅 I moved forward, and faced another issue but now I received the whole HTML in the error (v1.30.0)

databricks_mws_workspaces.this: Creating...
╷
│ Error: cannot create mws workspaces: Response from server (400 Bad Request) <html>
│ <head>
│ <meta http-equiv="Content-Type" content="text/html;charset=utf-8"/>
│ <title>Error 400 MALFORMED_REQUEST: Failed network validation checks for network foo-network; got error: ArrayBuffer(error_type: &quot;securityGroup&quot;
│ error_message: &quot;Security Group with ID sg-xyz does not belong to VPC with ID vpc-xyz.&quot;
│ )</title>
│ </head>
│ <body><h2>HTTP ERROR 400</h2>
│ <p>Problem accessing /api/2.0/accounts/my-account-id/workspaces. Reason:
│ <pre>    MALFORMED_REQUEST: Failed network validation checks for network foo-network; got error: ArrayBuffer(error_type: &quot;securityGroup&quot;
│ error_message: &quot;Security Group with ID sg-xyz does not belong to VPC with ID vpc-xyz.&quot;
│ )</pre></p>
│ </body>
│ </html>
│ : invalid character '<' looking for beginning of value
│ 
│   with databricks_mws_workspaces.this,
│   on workspace.tf line 1, in resource "databricks_mws_workspaces" "this":
│    1: resource "databricks_mws_workspaces" "this" {
│ 
╵
ERRO[0052] 1 error occurred:

I feel like I'm spamming here, folks, I'm sorry. I'm just trying to help and bring to light all the scenarios that I have been through.

I'll stop for now, I think that's enough. Let me know if I can help with something else.

tanmay-db commented 11 months ago

Hi @oliveirafilipe, thank you for the information. It is not spamming at all. Due to limited bandwidth we are prioritising other urgent issues. Just wanted to check, is this a blocker?

mgyucht commented 11 months ago

@oliveirafilipe sorry for the lack of update on this issue. This is terrific and will help us debug further. Note that I've also made another change in the SDK (https://github.com/databricks/databricks-sdk-go/pull/744) that will always include the request & response in the error message going forward to make it easier to report these issues.

In general, the TF provider and the underlying SDK depend on the REST API returning JSON appropriately. When that doesn't happen, this usually points to a bug in the underlying REST API. The improved error handling here will help TF and SDK users report issues to the SDK repo, where we can triage and follow-up with the responsible team to fix said issue.

mgyucht commented 11 months ago

@oliveirafilipe What was the change you made to cause the API to return an HTML page by the way?

oliveirafilipe commented 10 months ago

Hi folks! I'm sorry for not getting back to you sooner.

@tanmay-db Not a blocker, at all. I figured out the problem I was facing when using the UI, and that's how I realized that the provider could show a better error response like the UI.

@mgyucht I was trying to create a new Workspace but using the same subnet used by a previously created workspace. The error is expected since two workspaces can overlap their subnets.

mgyucht commented 10 months ago

@oliveirafilipe FYI: improved error messages have been merged into the Go SDK and will make their way into TF in the next release (likely next week).

alexott commented 10 months ago

Fixed in Go SDK 0.28.1 that was released in TF 1.32.0