aws / aws-cdk

The AWS Cloud Development Kit is a framework for defining cloud infrastructure in code
https://aws.amazon.com/cdk
Apache License 2.0
11.55k stars 3.87k forks source link

OpenSearch: Bug in Describe-Domain API is causing CFN GetAtt "Internal error occurred" #18239

Closed automartin5000 closed 2 years ago

automartin5000 commented 2 years ago

What is the problem?

When you create an OpenSearch Domain with a VPC and then attempt to reference that endpoint in the AWS CDK (thereby creating a GetAtt reference in CloudFormation), the Domain creates successfully, but then the CloudFormation resource (Fargate) that attempts to reference the endpoint returns an "Internal error occurred" (see attached screenshot). Additional findings from research detailed in "Other information" below.

Screen Shot 2022-01-02 at 21 45 28

Reproduction Steps

self.opensearch_domain = opensearch.Domain(self, "OpenSearchIndices",
    **opensearch_params,
    version=opensearch.EngineVersion.OPENSEARCH_1_0,
    vpc=self.scope.network_stack.vpc,        
    logging={
        "slow_search_log_enabled": True,
        "app_log_enabled": True,
        "slow_index_log_enabled": True
    },
    encryption_at_rest={
        "enabled": True
    },
    zone_awareness=opensearch.ZoneAwarenessConfig(
        enabled=True,
        availability_zone_count=zone_count
    ),
    removal_policy = self.data_resources_removal_policy
)
self.opensearch_endpoint = self.opensearch_domain.domain_endpoint 

What did you expect to happen?

All resource created successfully

What actually happened?

CloudFormation Stack rollback due resource creation failure. (Screenshot from above re-attached here)

Screen Shot 2022-01-02 at 21 45 28

CDK CLI Version

2.3

Framework Version

No response

Node.js Version

16.13.1

OS

Mac OS 12.1

Language

Python

Language Version

3.10.1

Other information

I noticed that I didn't have this problem when creating a public OpenSearch Domain. So I thought it might have something to do with how the API is returning domain endpoints with Domains created in a VPC vs public Domains.

I created a public Domain and then ran aws opensearch describe-domain against both the Domain created with the CDK and the test public Domain. Here were the results:

# Public Domain
~ % aws opensearch describe-domain --domain-name test | jq '.DomainStatus.Endpoint'
"search-test-xxxxxxxxx.us-east-1.es.amazonaws.com"
# Domain in VPC
~ % aws opensearch describe-domain --domain-name dataindic-xxxxxxxxx | jq '.DomainStatus.Endpoint' 
null
~ % aws opensearch describe-domain --domain-name dataindic-xxxxx | jq '.DomainStatus.Endpoints'
{
  "vpc": "vpc-xxxxxxx-yyyyyyy-zzzzzzzz.us-east-1.es.amazonaws.com"
}

As you can see, the Endpoint value is null for Domains in the VPC. Instead, it appears to put that value in a new key called "Endpoints". It appears that maybe CloudFormation wasn't updated to support the new "Endpoints" key or OpenSearch should be publishing endpoints for Domains in the VPC.

I understand that this might be a CloudFormation or OpenSearch bug, but until those teams sort it out, it's obviously a bug in the AWS CDK. And it seems like this is something the CDK could maybe work around for the time being with a custom resource. Example:

opensearch_client = boto3.client('opensearch')
opensearch_domain_details = opensearch_client.describe_domain(
      DomainName=aws_opensearch_domain_name
 )['DomainStatus']
opensearch_endpoint = opensearch_domain_details.get('Endpoint') or opensearch_domain_details.get('Endpoints')['vpc']
peterwoodworth commented 2 years ago

Thanks for the very thorough detail @automartin5000.

I've reported this internally, tracking: V498467686

peterwoodworth commented 2 years ago

Hey @automartin5000, according to the teams internally this bug has been fixed. Please try to redeploy your stack. If you still run into this error, would you be able to share your stack ID so that the teams can troubleshoot?

github-actions[bot] commented 2 years ago

This issue has not received a response in a while. If you want to keep this issue open, please leave a comment below and auto-close will be canceled.

automartin5000 commented 2 years ago

Thanks so much @peterwoodworth for the fast action on this! I'll give this a test early next week and see if it works.

automartin5000 commented 2 years ago

Ok, so I was actually just able to test this and not only is it still failing, there's another new concerning error.

For some reason, CloudFormation felt that it needed to update the IAM policies of the task definitions. Upon attempting to do so, and pulling the OpenSearch ARN that I previously had to add to all the IAM policies so our containers could manually describe-domain and get the OpenSearch endpoint, CloudFormation threw a new Internal error: Unable to retrieve Arn attribute for AWS::OpenSearchService::Domain, with error message Internal error occurred. See screenshot.

Screen Shot 2022-01-09 at 01 11 31

And as I said, getting the domain endpoint is still failing, see screenshot.

Screen Shot 2022-01-09 at 01 36 09
peterwoodworth commented 2 years ago

I'm so sorry this fell off my radar. @automartin5000 I assume you're still running into this issue yes? I'll follow up with the teams internally

peterwoodworth commented 2 years ago

Oh, and if you're able to provide your stack ID here that would be a big help for the teams internally, but I understand if you aren't able to.

automartin5000 commented 2 years ago

Thanks for the follow up on this @peterwoodworth. Last I checked was when I posted above and it was still broken. I don't think those stacks are there anymore, so I don't have a stack ID provide. But if the engineering team is convinced the bug is fixed, I can try to test it again sometime in the next week.

peterwoodworth commented 2 years ago

That would be much appreciated @automartin5000, thank you 🙂

stowns commented 2 years ago

having the same issue here. deploying AWS::OpenSearchService::Domain to a VPC is resulting in Internal Failure. Cfn rolls back and the Domain eventually is provisioned successfully

stowns commented 2 years ago

FYI, I had this error several times yesterday and after making no changes this just worked today. It appears to be a transient Internal Failure that the Service team(s) need to take a look at

peterwoodworth commented 2 years ago

Thank you very much @stowns, I'll pass this information on

peterwoodworth commented 2 years ago

Hey @automartin5000 and @stowns, if either of you are able to share a stack id of even a deleted stack that would be very helpful in finding the root cause of the issue. If you don't feel comfortable doing that, is it possible to open a support case directly with aws and share the stack id there?

automartin5000 commented 2 years ago

Ok, I just created a stack and the OpenSearch Endpoint was populated correctly in the task definition. So seems like it's fixed now.

edgar-slalom commented 2 years ago

This is happening to me. The OpenSearch is not in a VPC but I'm getting that error. This is through SAM/CloudFormation.

image

What are we supposed to do?

jbrown commented 2 years ago

@peterwoodworth I'm also seeing this, specifically in CDK when I reference the domainEndpoint (ex. for the environment of a lambda function) it fails with Unable to retrieve DomainEndpoint attribute for AWS::OpenSearchService::Domain, with error message Internal error occurred. The search domain is not in a VPC, and the stack works if I don't try to reference the domain. Stack ID: arn:aws:cloudformation:us-east-1:261515715507:stack/dev-nmls-backend/dc6198a0-8d26-11ec-af7a-12770bae9a0b

peterwoodworth commented 2 years ago

Thank you very much for providing a stack arn @jbrown, there's a ticket open internally for the service team to investigate and fix this, this will be a big help in finding the root cause 🙂

jbrown commented 2 years ago

@peterwoodworth Upon further investigation this morning it seems (in my case) there's one specific way this produces an error and it may actually be in the serverless-stack code. I'll report back here after I hear from them.

lxhunter commented 2 years ago

I had the same error when i changed: VolumeSize: 20 to VolumeSize: 5

I do not know if that helps...

jwang1048 commented 2 years ago

Hi, I saw a similar error and opened an AWS support case (but didn't get my problem resolved there). I am able to reliably reproduce the error with a small snippet of (mostly) vanilla CDK code :

import * as os from "monocdk/aws-opensearchservice";
import * as lambda from "monocdk/aws-lambda";
import {SecurityGroup, SubnetType, Vpc} from "monocdk/aws-ec2";
import {App, RemovalPolicy, Stack, Tags} from "monocdk";
export class TestStack extends DeploymentStack {
    constructor(parent: App, id: string, props: TestStackProps) {
       super(parent, id, {// company specific boilerplate})
       const vpc = new Vpc(this, 'TheVPC', {
            cidr: "10.1.0.0/16",
            maxAzs: 1
        });
        const devDomain = new os.Domain(this, 'Domain', {
            version: os.EngineVersion.ELASTICSEARCH_7_10,
            vpc: vpc,
            enforceHttps: true,
            removalPolicy: RemovalPolicy.DESTROY,
        });
        for (let i = 0; i < 10; i++) {
            const func = new lambda.Function(this, "Func" + i, {
                runtime: lambda.Runtime.PYTHON_3_8,
                handler: "test",
                environment: {
                    "ES_ENDPOINT": devDomain.domainEndpoint,
                },
                vpc: vpc,
                allowPublicSubnet: false,
                code: new lambda.InlineCode(`import os
def test(event, context):
  print(event)
  return os.environ`)
            });
            Tags.of(func).add("endpoint", devDomain.domainEndpoint);
            //Tags.of(func).add("arn", devDomain.domainArn);
        }

Looking at the CloudTrail logs, it appears that there are throttling exceptions on the ListTags and DescribeDomain operations. Most likely it was caused by a throttling on this DescribeDomain request (the last one). I was unable to find any more details about the requests.

{
    "eventVersion": "1.08",
    "userIdentity": {/*removed*/}
    "eventTime": "2022-02-24T13:52:31Z",
    "eventSource": "es.amazonaws.com",
    "eventName": "DescribeDomain",
    "awsRegion": "us-west-2",
    "sourceIPAddress": "cloudformation.amazonaws.com",
    "userAgent": "cloudformation.amazonaws.com",
    "errorCode": "ThrottlingException",
    "errorMessage": "Rate exceeded",
    "requestParameters": null,
    "responseElements": null,
    "requestID": "c70a9982-c6a1-4cfe-861a-53667e144e50",
    "eventID": "ee0d4410-a20c-4903-87f4-d5cda62c000c",
    "readOnly": true,
    "eventType": "AwsApiCall",
    "managementEvent": true,
    "recipientAccountId": "970187786794",
    "eventCategory": "Management"
}

Additionally, it does appear to be a throttling issue on API calls, as reducing the number of Lambda functions from 10 to 1 allows the stack to create successfully. I am not entirely sure how CloudFormation does the tagging, but it appears to make an API call for each tag, which could be quite a lot for a Lambda function in a VPC (associated with a security group and IAM role at least).

Stack ID: arn:aws:cloudformation:us-west-2:970187786794:stack/TestStack-beta-us-west-2/cdc86410-94e0-11ec-a667-023be3ac2b21

(Issue is reproducible in us-east-1 as well).

The workaround I'm currently trying with success is reducing the number of Fn::GetAtt calls to the domain resource by eliminating excess tags. Maybe you could also try to "spread out" the API requests by interleaving them with resources that take longer to create using CDK's dependency mechanism.

peterwoodworth commented 2 years ago

Yes, the service team has gotten back to me and confirmed for the stack arns available, that all of them were due to the throttling limits.

I'm working with them on getting the error message improved, to make it clear to the user what the failure is being caused by

jwang1048 commented 2 years ago

@peterwoodworth I think this is more than a non-informative error message (and user education) - it does not occur when using the AWS::Elasticsearch::Domain resource where I am able to create at least 15 Lambda functions with tags. There are also throttling errors in CloudTrail there, but the difference is that the service doesn't "give up" creating the resource on the first throttle. There needs to be a retry policy with spaced out requests, as the issue fails some deployments 100% of the time.

Example of a throttling on the DescribeElasticsearchDomain API call (using the old AWS::Elasticsearch::Domain)

{
    "eventVersion": "1.08",
    "userIdentity": {
        "type": "AssumedRole",
         // Additional fields removed
        "invokedBy": "cloudformation.amazonaws.com"
    },
    "eventTime": "2022-02-24T15:19:32Z",
    "eventSource": "es.amazonaws.com",
    "eventName": "DescribeElasticsearchDomain",
    "awsRegion": "us-west-2",
    "sourceIPAddress": "cloudformation.amazonaws.com",
    "userAgent": "cloudformation.amazonaws.com",
    "errorCode": "ThrottlingException",
    "errorMessage": "Rate exceeded",
    "requestParameters": null,
    "responseElements": null,
    "requestID": "a673e667-0d7b-4fc7-936a-7268d2189eae",
    "eventID": "bff22dd7-709c-4e58-8cef-d86fd3e5e726",
    "readOnly": true,
    "eventType": "AwsApiCall",
    "managementEvent": true,
    "recipientAccountId": "970187786794",
    "eventCategory": "Management"
}
TinoSM commented 2 years ago

In my case it was caused because I was passing the endpoint URL to a lambda (CustomResource) on each call (which I guess caused resolution of the attribute multiple times + throttling).

I changed it to environment variable in the lambda and its solved now.

PS: However I think this should be fixed in AWS/CDK-side by implementing retries with backoff (or whatever the mechanism), I can't apply this solution for all our use-cases and also this essentially limits the amount of resources you can deploy within your CDK...

automartin5000 commented 2 years ago

Yes, the service team has gotten back to me and confirmed for the stack arns available, that all of them were due to the throttling limits.

I'm working with them on getting the error message improved, to make it clear to the user what the failure is being caused by

@peterwoodworth I could imagine this being our issue too, although we only have 5 task definitions using that value. This seems like something CloudFormation should just auto-retry?

spullara commented 2 years ago

Having this issue as well because of throttling as I have many lambdas.

An error occurred: VideobarcodeLambdaFunction - Unable to retrieve DomainEndpoint attribute for AWS::OpenSearchService::Domain, with error message Internal error occurred..

SamStephens commented 2 years ago

@peterwoodworth any news on an actual fix for this, retries with backoff or some such as @TinoSM and @automartin5000 recommend.

I've just hit this in the middle of updating an Opensearch cluster from Elasticsearch 7.10 to Opensearch 1.2, and it's an AWFUL experience, as the rollback fails each time because upgrading from Elasticsearch 7.10 to Opensearch 1.2 cannot be reversed. I'm having to skip rolling back the cluster, and then try and roll forward again until hopefully eventually I'll manage an deploy without hitting throttling limits.

As the others say, this is certainly something that should be handled AWS side, we have no control over Cloudformation and the operations it performs, we should not be having to account for Cloudformation failing to deal reasonably with throttling. Honestly, I'm somewhat shocked that Cloudformation doesn't have global handling of the backoffs and retries needed for dealing with throttling limits.

zessx commented 2 years ago

To prevent to hit Cloudformation limits, is there a way in CDK to resolve the domain endpoint once in a variable, instead of resolving it once per task definition / lambda?

spullara commented 2 years ago

To prevent to hit Cloudformation limits, is there a way in CDK to resolve the domain endpoint once in a variable, instead of resolving it once per task definition / lambda?

The API calls are being down by cloudformation and not the CDK so I think this has be fixed upstream in that system.

jwang1048 commented 2 years ago

You can work around this issue by using CDK's dependency mechanism to slow down the API requests. If you have multiple resources (e.g. A, B, C, D) accessing OpenSearch Domain attributes, you can make them execute one after the other (rather than simultaneously) with D.node.addDependency(C); C.node.addDependency(B); B.node.addDependency(A); - see details in the docs.

You can also attempt more drastic solutions like using a Lambda custom resource or Systems Manager parameters, but I think the dependency mechanism is the simplest way to do it.

Some time ago, @peterwoodworth asked for an update on the tracking ticket but there is no update to share at this time. I don't work for AWS (hence not involved in the prioritization of issues) but adding a +1 to this issue may help to get a faster resolution.

zessx commented 2 years ago

You can work around this issue by using CDK's dependency mechanism to slow down the API requests.

Node dependencies are a great workaround, I've been able to use them and avoid hitting limits. It really slows down my deploy when I need to refresh my task definitions (as I've got dozens of them), but in the end I still gain time.

I was not aware of this feature, and it's really, REALLY interesting for some other use cases of mine, thanks!

peterwoodworth commented 2 years ago

I've been told that you will now receive a proper error message in the case of throttling. Can anyone here confirm this is the case?

SamStephens commented 2 years ago

@peterwoodworth said:

I've been told that you will now receive a proper error message in the case of throttling. Can anyone here confirm this is the case?

Most of us currently following here have some form of workaround for this issue in place, and I don't think any of us will be removing that workaround until this issue is fixed properly. We will not be removing our workarounds, because we cannot expose our stacks to non-deterministic failures. I've already described the experience I had where the rollback failed because I hit this error during a non-reversible upgrade. A decent error message would not have helped me get out of the awful position this Cloudformation defect left me in.

A proper error message is better than nothing. However throttling is an implementation detail of the deployments Cloudformation does that we should not be exposed to as users at all. The abstraction is leaking.

peterwoodworth commented 2 years ago

I agree that I'd like to have the root of the issue fixed. This has been frustratingly difficult to communicate with the opensearch team, I'll reiterate these thoughts again with them

SamStephens commented 2 years ago

I agree that I'd like to have the root of the issue fixed. This has been frustratingly difficult to communicate with the opensearch team, I'll reiterate these thoughts again with them

Thanks @peterwoodworth.

Do you think it's worth getting the Cloudformation team involved? It feels to me like Cloudformation should have unified handling of throttling across all services, rather than every service having to solve this issue in isolation.

Also this is an impact on Cloudformation customers and the Cloudformation experience. Even if responsibility does end up lying with the Opensearch team, the Cloudformation team will better understand how this hurts customers, and can then hold the Opensearch team accountable.

peterwoodworth commented 2 years ago

Good news! The proper fix to this should be rolling out in the next couple weeks

peterwoodworth commented 2 years ago

The fix to this is live - CloudFormation should be retrying before throwing an error due to throttling now (which will be hidden from the user). Please let me know if anyone is still running into this issue, thanks all for your patience here 🙂

zessx commented 2 years ago

I've been able to test the fix this morning with great success, many thanks @peterwoodworth !

I've got around 30 ECS task definitions in a stack and at least one Describe-Domain call per definition (there may have multiple calls, not sure how it's handled by CF). I changed an environment variable to cause all of them to be deployed at once. Everything went fine, I had no warning/error at all. This would have failed every single time before the fix.

Would be nice to get this fix tested on other stacks, but as for me I consider this issue solved 🎉

peterwoodworth commented 2 years ago

This is wonderful to hear! Glad this is working for you now 🙂

github-actions[bot] commented 2 years ago

⚠️COMMENT VISIBILITY WARNING⚠️

Comments on closed issues are hard for our team to see. If you need more assistance, please either tag a team member or open a new issue that references this one. If you wish to keep having a conversation with other community members under this issue feel free to do so.