aws / aws-cdk

The AWS Cloud Development Kit is a framework for defining cloud infrastructure in code
https://aws.amazon.com/cdk
Apache License 2.0
11.54k stars 3.86k forks source link

(OpenSearch): (Upgrade 1.3 > 2.3 fails at domain update) #23794

Closed anthonyhartwig closed 1 month ago

anthonyhartwig commented 1 year ago

Describe the bug

Upgraded OpenSearch cluster via CDK from 1.3 to 2.3. Cluster seems to update perfectly fine, but CloudFormation fails when attempting to update the domain with:

Resource handler returned message: "Invalid request provided: DP Nodes are OOS, Tags operation is not allowed"

Expected Behavior

Domain resource successfully updates

Current Behavior

After cluster fully upgraded to 2.3, CloudFormation fails to update domain with error:

Resource handler returned message: "Invalid request provided: DP Nodes are OOS, Tags operation is not allowed"

Reproduction Steps

Modify domain from:

Domain = new Domain(this, "Domain", new DomainProps { Version = EngineVersion.OPENSEARCH_1_3, ... });

to

Domain = new Domain(this, "Domain", new DomainProps { Version = EngineVersion.OPENSEARCH_2_3, ... });

Possible Solution

No response

Additional Information/Context

No response

CDK CLI Version

2.61.0

Framework Version

No response

Node.js Version

16.13.1

OS

Mac

Language

.NET

Language Version

6

Other information

No response

pahud commented 1 year ago

Thanks. I'll try reproduce this in my account.

mason-fish commented 1 year ago

I also began encountering this after switching from a private to public vpc subnet. That succeeded but all subsequent deploys now fail with the above error, even if I try to remove the whole stack in my code. My domain was already on version 2.3 and my cdk version is 2.47.0 (update: just upgraded cdk to 2.65.0 with same result).

epiphone commented 1 year ago

This SO answer suggests that CDK execution role trust relationships are the problem: https://stackoverflow.com/a/75239517/1763012. Didn't work for me though.

jessedobbelaere commented 1 year ago

I also experience this. The cluster updated just fine to 2.3, but cloudformation fails in a UPDATE_ROLLBACK_FAILED state with

Resource handler returned message: "Invalid request provided: DP Nodes are OOS, Tags operation is not allowed (Service: OpenSearch, Status Code: 400

When I use the "Continue update rollback" dialog and skip the OpenSearch resource, then on the next deploy the stack fails with "Internal server error". Basically makes the stack unusable. I've reported the issue to AWS support.

pahud commented 1 month ago

I am looking into this issue again.

Domain = new Domain(this, "Domain", new DomainProps { Version = EngineVersion.OPENSEARCH_1_3, ... });

Can you provide the minimal code snippet that you provision the 1.3 cluster and upgrade to 2.3?

anthonyhartwig commented 1 month ago

I am looking into this issue again.

Domain = new Domain(this, "Domain", new DomainProps { Version = EngineVersion.OPENSEARCH_1_3, ... });

Can you provide the minimal code snippet that you provision the 1.3 cluster and upgrade to 2.3?

Our git repo has undergone some changes in the past few months and I no longer have visibility to our code base when it was on version 1.3. The best I can provide is from my original comment on this issue in which we changed the engine version.

pahud commented 1 month ago

I just deployed this for 1.3

   const vpc = getDefaultVpc(this);
    const azCount = vpc.selectSubnets({subnetType: SubnetType.PRIVATE_WITH_EGRESS}).availabilityZones.length;
    new opensearch.Domain(this, "Domain", {
      vpc,
      vpcSubnets: [
        { subnetType: SubnetType.PRIVATE_WITH_EGRESS },
      ],
      version: opensearch.EngineVersion.OPENSEARCH_2_3,
      tlsSecurityPolicy: opensearch.TLSSecurityPolicy.TLS_1_2,
      enableVersionUpgrade: true,
      removalPolicy: RemovalPolicy.DESTROY,
      zoneAwareness: {
        enabled: true,
        availabilityZoneCount: azCount,
      },
      capacity: {
        dataNodeInstanceType: "r5.large.search",
        dataNodes: azCount,
        masterNodes: 3,
        multiAzWithStandbyEnabled: true,
        masterNodeInstanceType: "r5.large.search",
      },
      ebs: {
        volumeSize: 10,
        volumeType: ec2.EbsDeviceVolumeType.GP3,
      },
    });

Now after I updated this to 2.3 and run npx cdk diff I got this

Resources
[~] AWS::OpenSearchService::Domain Domain Domain66AC69E0 
 └─ [~] EngineVersion
     ├─ [-] OpenSearch_1.3
     └─ [+] OpenSearch_2.3

And cdk deploy was successful on domain upgrade.

image

I think this issue is not relevant now. I am resolving it. Feel free to create a new one if it's still relevant.

github-actions[bot] commented 1 month ago

Comments on closed issues and PRs are hard for our team to see. If you need help, please open a new issue that references this one.