DataDog / datadog-cloudformation-resources

Apache License 2.0
49 stars 35 forks source link

AWS integration fails to create the integration and rolls back cloudformation stack with Internal failure. #236

Open dogfish182 opened 1 year ago

dogfish182 commented 1 year ago

Describe the bug AWS integration fails with obscure error

To Reproduce Steps to reproduce the behavior: run a template that looks like this

Resources:
  DatadogAWSDatadogIntegrationAWS:
    Type: Datadog::Integrations::AWS
    Properties:
      AccountID: '123123123123'
      RoleName: shared-datadog-aws-integration
    Metadata:
      aws:cdk:path: mystack/DatadogAWSDatadogIntegrationAWS
  DatadogRoleF31A7099:
    Type: AWS::IAM::Role
    Properties:
      AssumeRolePolicyDocument:
        Statement:
          - Action: sts:AssumeRole
            Condition:
              StringEquals:
                sts:ExternalId:
                  Fn::Join:
                    - ''
                    - - '{{resolve:secretsmanager:arn:'
                      - Ref: AWS::Partition
                      - :secretsmanager:eu-west-1:123123123123:secret:DatadogIntegrationExternalID:SecretString:::}}
            Effect: Allow
            Principal:
              AWS: arn:aws:iam::464622532012:root
        Version: '2012-10-17'
      Description: Datadog integration for aws monitoring
      PermissionsBoundary:
        Fn::Join:
          - ''
          - - 'arn:aws:iam::'
            - Ref: AWS::AccountId
            - :policy/base-permissions-boundary
      RoleName: shared-datadog-aws-integration
      Tags:
        - Key: tag
          value: tag
    DependsOn:
      - DatadogAWSDatadogIntegrationAWS
    Metadata:
      aws:cdk:path: mystack/DatadogRole/Resource
  DatadogRolePolicy6CE03EE3:
    Type: AWS::IAM::Policy
    Properties:
      PolicyDocument:
        Statement:
          - Action:
              - alldatadogstuffasperdocs
            Effect: Allow
            Resource: '*'
        Version: '2012-10-17'
      PolicyName: shared-datadog-integration-policy
      Roles:
        - Ref: DatadogRoleF31A7099

Logs

1:36:58 PM | CREATE_FAILED        | Datadog::Integrations::AWS                  | DatadogAWSDatadogIntegrationAWS
Resource handler returned message: "" (RequestToken: 16b2f5a7-3d09-738e-76ae-33db3a6ad5b8, HandlerErrorCode: InternalFa
ilure)

 ❌  mystack failed: Error: The stack named mystack failed to deploy: UPDATE_ROLLBACK_COMPLETE: Resource handler returned message: "" (RequestToken: 16b2f5a7-3d09-738e-76ae-33db3a6ad5b8, HandlerErrorCode: InternalFailure)
    at FullCloudFormationDeployment.monitorDeployment (/Users/me/code/place/project/node_modules/aws-cdk/lib/api/deploy-stack.ts:505:13)
    at processTicksAndRejections (node:internal/process/task_queues:96:5)
    at deployStack2 (/Users/me/code/place/project/node_modules/aws-cdk/lib/cdk-toolkit.ts:265:24)
    at /Users/me/code/place/project/node_modules/aws-cdk/lib/deploy.ts:39:11
    at run (/Users/me/code/place/project/node_modules/p-queue/dist/index.js:163:29)

 ❌ Deployment failed: Error: Stack Deployments Failed: Error: The stack named mystack failed to deploy: UPDATE_ROLLBACK_COMPLETE: Resource handler returned message: "" (RequestToken: 16b2f5a7-3d09-738e-76ae-33db3a6ad5b8, HandlerErrorCode: InternalFailure)
    at deployStacks (/Users/me/code/place/project/node_modules/aws-cdk/lib/deploy.ts:61:11)
    at processTicksAndRejections (node:internal/process/task_queues:96:5)
    at CdkToolkit.deploy (/Users/me/code/place/project/node_modules/aws-cdk/lib/cdk-toolkit.ts:339:7)
    at initCommandLine (/Users/me/code/place/project/node_modules/aws-cdk/lib/cli.ts:374:12)

Stack Deployments Failed: Error: The stack named mystack failed to deploy: UPDATE_ROLLBACK_COMPLETE: Resource handler returned message: "" (RequestToken: 16b2f5a7-3d09-738e-76ae-33db3a6ad5b8, HandlerErrorCode: InternalFailure)

Expected behavior The cloudformation should run to completion. I expect the account integration to enable the account in datadog (this does occur) I expect the secret to be written to secrets manager (this does NOT occur) I expect my role to be created which I pull the secret from secrets manager (this does NOT occur)

Environment and Versions (please complete the following information): Datadog AWS Integration 2.2.1 I am generating cloudformation via cdkv2 however I doubt this is relevant as I've included the generated cloudformation template above (which is run and faults).

Additional context It essentially looks like the cloudformation handler is swallowing the error, which makes it very hard to troubleshoot this. I've also logged a ticket with datadog support.

dogfish182 commented 1 year ago

To put a bit more context on this issue, I'm confused by the datadog instructions on how to setup this integration (and have a support ticket running).

This page https://github.com/DataDog/cloudformation-template/tree/master/aws

^^ says it will setup datadog for you, however one of the first steps is to manually provision your accounts in datadog and copy the externalID as a parameter before you manually run cloudformation. (not really doable at any kind of scale). At the end of the doc it says you can use THIS integration if you wish to manage the integration, this seems like circular logic, because if I already set it up manually then it's unmanaged now?

What I would like to achieve is to use this integration which creates the datadog side resources and then create the AWS side resource myself and input the externalID into the role I'm creating, by reading the secrets manger entry that this extension writes.

Has anyone been able to achieve this?

github-actions[bot] commented 1 year ago

Thanks for your contribution!

This issue has been automatically marked as stale because it has not had activity in the last 30 days. Note that the issue will not be automatically closed, but this notification will remind us to investigate why there's been inactivity. Thank you for participating in the Datadog open source community.

If you would like this issue to remain open:

  1. Verify that you can still reproduce the issue in the latest version of this project.

  2. Comment that the issue is still reproducible and include updated details requested in the issue template.

dogfish182 commented 1 year ago

I can still reproduce this issue as shown in the orginal post.

flavioelawi commented 1 year ago

We are facing the same error (although on Monitor and Dashboards) We have a support case open with AWS

dogfish182 commented 1 year ago

We are facing the same error (although on Monitor and Dashboards) We have a support case open with AWS

I did the same and they told us we need to contact datadog as the error is being swallowed by the custom cloudformation resource handler.

flavioelawi commented 1 year ago

Thanks we just did the same, lets see what happens

skarimo commented 1 year ago

Thanks for opening this issue. We are going to merge and release the change https://github.com/DataDog/datadog-cloudformation-resources/pull/258 which should catch any unhandled exceptions in the resources them selves.

However, this wouldn't expose all errors mainly because AWS does obfuscate logs/events quite heavily on their end so things such as bad type configuration and bad execution roles would still fail in non-obvious ways. Which I suspect is the reason for the failures you are seeing @flavioelawi with dashboards and monitors

flavioelawi commented 1 year ago

We have resolved our issue;

our execution role already had the correct trust policy:

    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "Service": [
                    "resources.cloudformation.amazonaws.com",
                    "cloudformation.amazonaws.com"
                ]
            },
            "Action": "sts:AssumeRole"
        }
    ]
}

And a policy to allow access to the Secrets and its Kms key

        {
            "Action": [
                "secretsmanager:GetSecretValue",
                "kms:Decrypt",
                "kms:DescribeKey"
            ],
            "Resource": "*",
            "Effect": "Allow"
        },

We also added the CloudWatchLogsFullAccess managed policy to allow for the integration to push logs to Cloudwatch logs (but its log group is still empty, I guess for another issue)

The issue in our case was a typo in the dynamic reference, where we were missing the SecretString part before the Json attribute selector.

@dogfish182 in your case you are missing external_id from your dynamic reference at the end, this is what is setup by the integration lambda/code

Also some feedback:

skarimo commented 1 year ago

We released the AWS resource version 2.4.0 that should capture and return any unhandled exception on the resource it self. However, as mentioned previously, errors swallowed by AWS would probably still not be captured by this change as it happens outside of the resource handler.