dynatrace-oss / terraform-provider-dynatrace

Apache License 2.0
71 stars 33 forks source link

Connection reset when attempting to create dynatrace_automation_workflow #354

Closed mattBaumBeneva closed 11 months ago

mattBaumBeneva commented 1 year ago

Describe the bug We are attempting to create Automation Workflows via Terraform. We have added the required scopes to our OAuth client but we obtain the following network error when applying:

12:23:04  │ Error: Post "https://xcz80425.apps.dynatrace.com/platform/automation/v1/workflows": read tcp 10.0.2.100:35668->23.22.184.182:443: read: connection reset by peer
12:23:04  │ 
12:23:04  │   with dynatrace_automation_workflow.easyTravel_guardian_Validation_terraform,
12:23:04  │   on testWorkflow.tf line 11, in resource "dynatrace_automation_workflow" "easyTravel_guardian_Validation_terraform":
12:23:04  │   11: resource "dynatrace_automation_workflow" "easyTravel_guardian_Validation_terraform" {
12:23:04  │ 

We have simplified the Workflow used for testing, as our real workflow is complex:

resource "dynatrace_automation_workflow" "easyTravel_guardian_Validation_terraform" {
  title = "easyTravel guardian Validation Terraform"

  tasks {
    task {
      name        = "run_validation_easy_travel"
      description = "Automation action to start a Site Reliability Guardian validation"
      action      = "dynatrace.site.reliability.guardian:validate-guardian-action"
      active      = false
      input = jsonencode({
        "executionId" : "{{ execution().id }}",
        "objectId" : "vu9U3hXa3q0AAAABADFhcHA6ZHluYXRyYWNlLnNpdGUucmVsaWFiaWxpdHkuZ3VhcmRpYW46Z3VhcmRpYW5zAAZ0ZW5hbnQABnRlbmFudAAkY2RmYzAzNDEtZjhjNC0zNDZkLTkxZmUtMDZmZmY4NTcxZThhvu9U3hXa3q0",
        "timeframeInputType" : "timeframeSelector",
        "timeframeSelector" : {
          "from" : "now-7d",
          "to" : "now"
        }
      })
      position {
        x = 2
        y = 1
      }
    }
  }
}

When applying the resulting JSON via the REST API with Postman:

{"isPrivate":true,"schemaVersion":3,"tasks":{"run_validation_easy_travel":{"action":"dynatrace.site.reliability.guardian:validate-guardian-action","active":false,"concurrency":null,"description":"Automation action to start a Site Reliability Guardian validation","input":{"executionId":"{{ execution().id }}","objectId":"vu9U3hXa3q0AAAABADFhcHA6ZHluYXRyYWNlLnNpdGUucmVsaWFiaWxpdHkuZ3VhcmRpYW46Z3VhcmRpYW5zAAZ0ZW5hbnQABnRlbmFudAAkY2RmYzAzNDEtZjhjNC0zNDZkLTkxZmUtMDZmZmY4NTcxZThhvu9U3hXa3q0","timeframeInputType":"timeframeSelector","timeframeSelector":{"from":"now-7d","to":"now"}},"name":"run_validation_easy_travel","position":{"x":2,"y":1},"timeout":900}},"title":"easyTravel guardian Validation Terraform"}

We obtain an HTTP 201, as expected:

{
    "id": "32629ffa-4927-46f2-bf8a-a93150b1c928",
    "title": "easyTravel guardian Validation Terraform",
    "taskDefaults": {},
    "usages": [],
    "lastExecution": null,
    "description": "",
    "version": 1,
    "actor": "bcb27145-f928-41e6-8e7c-90d6de5577e7",
    "owner": "bcb27145-f928-41e6-8e7c-90d6de5577e7",
    "ownerType": "USER",
    "isPrivate": true,
    "triggerType": "Manual",
    "schemaVersion": 3,
    "trigger": {},
    "modificationInfo": {
        "createdBy": "bcb27145-f928-41e6-8e7c-90d6de5577e7",
        "createdTime": "2023-11-06T17:33:02.069528Z",
        "lastModifiedBy": "bcb27145-f928-41e6-8e7c-90d6de5577e7",
        "lastModifiedTime": "2023-11-06T17:33:02.069543Z"
    },
    "tasks": {
        "run_validation_easy_travel": {
            "action": "dynatrace.site.reliability.guardian:validate-guardian-action",
            "active": false,
            "concurrency": null,
            "description": "Automation action to start a Site Reliability Guardian validation",
            "input": {
                "executionId": "{{ execution().id }}",
                "objectId": "vu9U3hXa3q0AAAABADFhcHA6ZHluYXRyYWNlLnNpdGUucmVsaWFiaWxpdHkuZ3VhcmRpYW46Z3VhcmRpYW5zAAZ0ZW5hbnQABnRlbmFudAAkY2RmYzAzNDEtZjhjNC0zNDZkLTkxZmUtMDZmZmY4NTcxZThhvu9U3hXa3q0",
                "timeframeInputType": "timeframeSelector",
                "timeframeSelector": {
                    "from": "now-7d",
                    "to": "now"
                }
            },
            "name": "run_validation_easy_travel",
            "position": {
                "x": 2,
                "y": 1
            },
            "timeout": 900
        }
    }
}

Expected behavior The provider should correctly create the Workflow. API-level errors should be caught and reported clearly.

Additional context Provider v1.45.0.

mattBaumBeneva commented 11 months ago

This bug is still present in provider version 1.47.0. Tested today.

mattBaumBeneva commented 11 months ago

@Dynatrace-Reinhard-Pilz or @kishikawa12 , any thoughts on the cause?

Dynatrace-Reinhard-Pilz commented 11 months ago

I will look into that before the next release.

mattBaumBeneva commented 11 months ago

@Dynatrace-Reinhard-Pilz , very much appreciated.

Dynatrace-Reinhard-Pilz commented 11 months ago

Hello @mattBaumBeneva,

The upcoming release will contain the necessary routines that allow us to capture HTTP traffic for the resource dynatrace_automation_workflow. In order to capture that information you will have to set these environment variables:

With these environment variables you will get the complete HTTP traffic - including the negotiation with sso.dynatrace.com - dumped into a file named terraform-provider-dynatrace.http.log.

You mentioned initially that you're able to create the Workflow using Postman. Are you using Postman on the same host that executes Terraform for you? We've had issues in the past where Terraform was getting executed within a Jenkins pipeline - and networking limitations for the Jenkins workers interfered with OAuth2 - which requires to reach out to sso.dynatrace.com in addition to the hosts that serve the Dynatrace Environment.

mattBaumBeneva commented 11 months ago

The firewall issue to which you are refering was at our company, so it should be fine now (we apply other ressources with the OAuth2 flow). I will perform tests once the new version is released and provide the logs.

mattBaumBeneva commented 11 months ago

@Dynatrace-Reinhard-Pilz , we cannot actually test the fix. We are blocked by: https://github.com/dynatrace-oss/terraform-provider-dynatrace/issues/366

mattBaumBeneva commented 11 months ago

@Dynatrace-Reinhard-Pilz , I was able to test. I no longer see a connection reset, but the debugging log indicates the plugin is crashing (panic) on a Segfault:

12:03:57  │ Error: Plugin did not respond
12:03:57  │ 
12:03:57  │   with dynatrace_automation_workflow.easyTravel_guardian_Validation_terraform,
12:03:57  │   on testWorkflow.tf line 11, in resource "dynatrace_automation_workflow" "easyTravel_guardian_Validation_terraform":
12:03:57  │   11: resource "dynatrace_automation_workflow" "easyTravel_guardian_Validation_terraform" {
12:03:57  │ 
12:03:57  │ The plugin encountered an error, and failed to respond to the
12:03:57  │ plugin.(*GRPCProvider).ApplyResourceChange call. The plugin logs may
12:03:57  │ contain more details.
12:03:57  Stack trace from the terraform-provider-dynatrace_v1.47.2 plugin:
12:03:57  
12:03:57  panic: runtime error: invalid memory address or nil pointer dereference
12:03:57  [signal SIGSEGV: segmentation violation code=0x1 addr=0x40 pc=0x93b1a3]
12:03:57  
12:03:57  goroutine 191 [running]:
12:03:57  github.com/dynatrace-oss/terraform-provider-dynatrace/dynatrace/api/automation/workflows.(*MyRoundTripper).RoundTrip(0xc000d12cf0, 0xc000df0600)
12:03:57    github.com/dynatrace-oss/terraform-provider-dynatrace/dynatrace/api/automation/workflows/service.go:65 +0x3a3
12:03:57  golang.org/x/oauth2.(*Transport).RoundTrip(0xc000c90c40, 0xc001195200)
12:03:57    golang.org/x/oauth2@v0.11.0/transport.go:55 +0x3ea
12:03:57  net/http.send(0xc001195200, {0x167bfa0, 0xc000c90c40}, {0x1422100?, 0x1?, 0x0?})
12:03:57    net/http/client.go:251 +0x5f7
12:03:57  net/http.(*Client).send(0xc000d359e0, 0xc001195200, {0xc000d0cc00?, 0x1000000010b6200?, 0x0?})
12:03:57    net/http/client.go:175 +0x9b
12:03:57  net/http.(*Client).do(0xc000d359e0, 0xc001195200)
12:03:57    net/http/client.go:715 +0x8fc
12:03:57  net/http.(*Client).Do(...)
12:03:57    net/http/client.go:581
12:03:57  github.com/dynatrace-oss/terraform-provider-dynatrace/monaco/pkg/rest.executeRequest.func1()
12:03:57    github.com/dynatrace-oss/terraform-provider-dynatrace/monaco/pkg/rest/request.go:133 +0x85
12:03:57  github.com/dynatrace-oss/terraform-provider-dynatrace/monaco/pkg/rest.executeWithRateLimiter(0xc000c5c730)
12:03:57    github.com/dynatrace-oss/terraform-provider-dynatrace/monaco/pkg/rest/request.go:164 +0x66
12:03:57  github.com/dynatrace-oss/terraform-provider-dynatrace/monaco/pkg/rest.executeRequest(0xc000d359e0, 0xc001195200)
12:03:57    github.com/dynatrace-oss/terraform-provider-dynatrace/monaco/pkg/rest/request.go:132 +0x134
12:03:57  github.com/dynatrace-oss/terraform-provider-dynatrace/monaco/pkg/rest.Post(0x0?, {0xc000956140, 0x44}, {0xc0003c1500, 0x2c5, 0x300})
12:03:57    github.com/dynatrace-oss/terraform-provider-dynatrace/monaco/pkg/rest/request.go:80 +0x173
12:03:57  github.com/dynatrace-oss/terraform-provider-dynatrace/monaco/pkg/client/automation.Client.INSERT({{0xc00004cb70?, 0xc00020b880?}, 0xc000d359e0?, 0xc000286630?}, 0xd?, {0xc0003c1500, 0x2c5, 0x300})
12:03:57    github.com/dynatrace-oss/terraform-provider-dynatrace/monaco/pkg/client/automation/client.go:153 +0x12d
12:03:57  github.com/dynatrace-oss/terraform-provider-dynatrace/dynatrace/api/automation/workflows.(*service).Create(0x40a5b3?, 0xc00020b880)
12:03:57    github.com/dynatrace-oss/terraform-provider-dynatrace/dynatrace/api/automation/workflows/service.go:132 +0x7e
12:03:57  github.com/dynatrace-oss/terraform-provider-dynatrace/dynatrace/settings.(*GenericCRUDService[...]).CreateWithContext(0xc000d126d0, {0x16b2a10, 0xc000bdaf60}, {0x16a5ec0, 0xc00020b880?})
12:03:57    github.com/dynatrace-oss/terraform-provider-dynatrace/dynatrace/settings/generic_crud_service.go:60 +0xbd
12:03:57  github.com/dynatrace-oss/terraform-provider-dynatrace/resources.(*Generic).Create(0xc00071baa0, {0x16b2a10, 0xc000bdaf60}, 0xc000be6900, {0x109d960, 0xc0005028c0})
12:03:57    github.com/dynatrace-oss/terraform-provider-dynatrace/resources/generic.go:181 +0x270
12:03:57  github.com/dynatrace-oss/terraform-provider-dynatrace/provider/logging.Enable.func1({0x16b2a10, 0xc000bdaf60}, 0x0?, {0x109d960, 0xc0005028c0})
12:03:57    github.com/dynatrace-oss/terraform-provider-dynatrace/provider/logging/logging.go:103 +0x83
12:03:57  github.com/hashicorp/terraform-plugin-sdk/v2/helper/schema.(*Resource).create(0xc0007d7340, {0x16b2a48, 0xc000c77230}, 0xd?, {0x109d960, 0xc0005028c0})
12:03:57    github.com/hashicorp/terraform-plugin-sdk/v2@v2.25.0/helper/schema/resource.go:707 +0x12e
12:03:57  github.com/hashicorp/terraform-plugin-sdk/v2/helper/schema.(*Resource).Apply(0xc0007d7340, {0x16b2a48, 0xc000c77230}, 0xc000d8a9c0, 0xc000199100, {0x109d960, 0xc0005028c0})
12:03:57    github.com/hashicorp/terraform-plugin-sdk/v2@v2.25.0/helper/schema/resource.go:837 +0xa85
12:03:57  github.com/hashicorp/terraform-plugin-sdk/v2/helper/schema.(*GRPCProviderServer).ApplyResourceChange(0xc000010210, {0x16b2a48?, 0xc000c77110?}, 0xc00012f900)
12:03:57    github.com/hashicorp/terraform-plugin-sdk/v2@v2.25.0/helper/schema/grpc_provider.go:1021 +0xe8d
12:03:57  github.com/hashicorp/terraform-plugin-go/tfprotov5/tf5server.(*server).ApplyResourceChange(0xc000000e60, {0x16b2a48?, 0xc000c768d0?}, 0xc0002e6d90)
12:03:57    github.com/hashicorp/terraform-plugin-go@v0.14.3/tfprotov5/tf5server/server.go:818 +0x574
12:03:57  github.com/hashicorp/terraform-plugin-go/tfprotov5/internal/tfplugin5._Provider_ApplyResourceChange_Handler({0x13fd960?, 0xc000000e60}, {0x16b2a48, 0xc000c768d0}, 0xc0002e6d20, 0x0)
12:03:57    github.com/hashicorp/terraform-plugin-go@v0.14.3/tfprotov5/internal/tfplugin5/tfplugin5_grpc.pb.go:385 +0x170
12:03:57  google.golang.org/grpc.(*Server).processUnaryRPC(0xc0004ea000, {0x16d6c80, 0xc0007216c0}, 0xc000c607e0, 0xc0007def00, 0x1fbd180, 0x0)
12:03:57    google.golang.org/grpc@v1.56.3/server.go:1335 +0xdf0
12:03:57  google.golang.org/grpc.(*Server).handleStream(0xc0004ea000, {0x16d6c80, 0xc0007216c0}, 0xc000c607e0, 0x0)
12:03:57    google.golang.org/grpc@v1.56.3/server.go:1712 +0xa2f
12:03:57  google.golang.org/grpc.(*Server).serveStreams.func1.1()
12:03:57    google.golang.org/grpc@v1.56.3/server.go:947 +0xca
12:03:57  created by google.golang.org/grpc.(*Server).serveStreams.func1
12:03:57    google.golang.org/grpc@v1.56.3/server.go:958 +0x15c
12:03:57  
12:03:57  Error: The terraform-provider-dynatrace_v1.47.2 plugin crashed!
12:03:57  
12:03:57  This is always indicative of a bug within the plugin. It would be immensely
12:03:57  helpful if you could report the crash with the plugin's maintainers so that it
12:03:57  can be fixed. The output above should help diagnose the issue.
12:03:57  
Dynatrace-Reinhard-Pilz commented 11 months ago

Ok, the good news about this error: It tells me already that the HTTP request never was made - instead an error got thrown. The bad news about it: The debugging code wasn't prepared for a situation where no HTTP Response is present - hence the plugin crash.

I will have to push out yet another code change in order to deal with that. Odds are, however, that the error message the provider is getting back here is nothing else than the original read tcp 10.0.2.100:35668->23.22.184.182:443: read: connection reset by peer.

If time allows it, I will create v1.47.3 still today so you're able to run another test with the environment variables.

Dynatrace-Reinhard-Pilz commented 11 months ago

Hello @mattBaumBeneva, v1.47.3 has been released. Whenever you have time, please re-run with the environment variables. The provider shouldn't crash in this case anymore.

mattBaumBeneva commented 11 months ago

@Dynatrace-Reinhard-Pilz , I have emailed the entire log to your Dynatrace address (copied from our Jenkins output log).

Dynatrace-Reinhard-Pilz commented 11 months ago

Thanks a lot for the logs. I believe I found something that's worth investigating:

In our earlier ticket, where fetching the OAuth Bearer Token failed, the root cause was, if I'm not mistaken, that the Jenkins workers were not able to reach out to sso.dynatrace.com. Once that was opened up, things went back to normal.

With the resource dynatrace_automation_workflow yet another host name comes into play. As opposed to most other resources, which require to connect to https://<tenantid>.live.dynatrace.com, addressing the REST API for Gen3 functionality requires Terraform to connect to https://<tenantid>.apps.dynatrace.com.

The logs are unfortunately scattered with HTTP traffic that originates from various other resources, therefore it's not easy to spot. But with a bit of automated cross matching of expected response content it became clear, that the only requests you cannot find response content for within the logs are the ones addressing https://<tenantid>.apps.dynatrace.com (which in your case are just the workflows at the moment).

Can you check with your networking team about that theory?

mattBaumBeneva commented 11 months ago

I was able to confirm that those flows don't pass our firewall. I have requested they be allowed and I will respond here once I have been able to test. If possible, I think the provider should catch and report such errors, with the failing URL, and report them to the user (for future customers who may run into this issue).

mattBaumBeneva commented 11 months ago

@Dynatrace-Reinhard-Pilz , we have allowed these connections to pass through our firewall. This corrected the issue and we have successfully applied the test workflow above. I am closing this issue as resolved. Thanks!

Dynatrace-Reinhard-Pilz commented 11 months ago

That's good news. And yes, I'm already working on recognizing that situation automatically so it doesn't require debug logs to get turned on.