(aws-elasticache): unable to make simple changes to redis cache without encountering failover error

nicksbrandon commented 3 years ago

We are unable to make simple changes, for example editing billing tags on Redis cache nodes via CDK, when the same change is possible via the AWS Console and CLI. To allow for easy horizontal scaling of our Elasticache Redis clusters, i.e. by adding more shards, we have cluster mode enabled. As we tolerate cold caches and performance of a single node per shard is sufficient we do not deploy read replicas. When we attempt to change the billing tags via CDK we receive the error “Replication group must have at least one read replica to enable autofailover”.

Reproduction Steps

Here is the test code (I have obscured subnets etc. in the sample)

import { Stack, StackProps, Construct, Tags } from "@aws-cdk/core";
import { CfnReplicationGroup } from "@aws-cdk/aws-elasticache";

export class RedisTestStack extends Stack {
  constructor(scope: Construct, id: string, props?: StackProps) {
    super(scope, id, props);

    // Create Redis cache cluster
    var redisCluster = new CfnReplicationGroup(this, "TestCluster", {
      automaticFailoverEnabled: true,
      cacheNodeType: "cache.t2.micro",
      cacheParameterGroupName: "default.redis5.0.cluster.on",
      cacheSubnetGroupName: "xxxxxxxxx",
      engine: "redis",
      engineVersion: "5.0.6",
      numNodeGroups: 1,
      replicationGroupDescription: "Test Cluster",
      replicationGroupId: "TestCluster",
      replicasPerNodeGroup: 0,
      securityGroupIds: ["xxxxxxxx", "xxxxxxx"]
    });

    // Now add a tag
    Tags.of(redisCluster).add('Tag1', 'Value1')
  }
}

I can deploy the first part without issue. However when I add the Tagging code ..

    // Now add a tag
    Tags.of(redisCluster).add('Tag1', 'Value1')

.. and attempt to redeploy I get an error.

I can tag the redis cluster using the aws cli without any issue:

aws elasticache add-tags-to-resource --resource-name arn:aws:elasticache:eu-west-1:999999999999:cluster:testcluster-0001-001 --tags Key=tag2,Value=value2

I can also manually tag in the AWS console.

What did you expect to happen?

I expected the cache to be tagged without service interruption similar to when using the CLI

What actually happened?

I got the following error when I added the Tagging code and attempted to redeploy.

TestCluster Replication group must have at least one read replica to enable autofailover. (Service: AmazonElastiCache; Status Code: 400; Error Code: InvalidReplicationGroupState; 
Request ID: d74aa3d1-4ef5-472b-8eca-e5fdb508a7d4; Proxy: null)

This issue is not identified with synth or in diff.

Environment

I have tried with CDK 1.73.0 and 1.90.1 - Same result.

Here is the package.json file from the test

{
  "name": "redis_test",
  "version": "0.1.0",
  "bin": {
    "redis_test": "bin/redis_test.js"
  },
  "scripts": {
    "build": "tsc",
    "watch": "tsc -w",
    "test": "jest",
    "cdk": "cdk"
  },
  "devDependencies": {
    "@aws-cdk/assert": "1.73.0",
    "@types/jest": "^26.0.10",
    "@types/node": "10.17.27",
    "aws-cdk": "1.73.0",
    "jest": "^26.4.2",
    "ts-jest": "^26.2.0",
    "ts-node": "^8.1.0",
    "typescript": "~3.9.7"
  },
  "dependencies": {
    "@aws-cdk/aws-elasticache": "1.90.1",
    "@aws-cdk/core": "1.90.1",
    "source-map-support": "^0.5.16"
  }
}

Other

I would like to provision a Redis cache with cluster mode enabled so we can modify the number of shards if later required. That was straightforward to deploy initially as a single shard. However I then modified the CDK to tag the Redis cache with tags. This change failed to apply, complaining that it was unable to failover (only one node in the replication group). I could resolve this issue by adding read replicas to each shard but the function of this cache does not require a failover node and the additional cost is then undesirable. Tagging the node similarly outside of CDK does not require read replicas or for the node to be taken out of service.

I experimented with a Redis cache (Cluster mode disabled). I can tag the cache with a subsequent modification to the CDK project. However I cannot then add shards and, in the event I needed to scale, I would need to destroy the cache and create as cluster mode enabled. This is again not ideal.

I understand that a single shard without a read replica cannot support failover but it is not clear why CDK code demands failover for tagging the cache. Within the AWS console I can easily tag the nodes without any interruption to service.

If you could please advise what CDK configuration I am missing to make this possible that would be appreciated.

This is :bug: Bug Report

iliapolo commented 3 years ago

@nicksbrandon Thanks for reporting this. I can confirm the behavior you describe here.

I'm investigating this.

iliapolo commented 3 years ago

@nicksbrandon It seems your use-case surfaced a missing validation on the creation of replication group that has cluster mode enabled. The missing validation incorrectly allows the creation of a cluster without replicas but with auto-failover enabled. The service team is aware of the problem and already looking to fix the issue. While that's not resolved, please make sure to add replicas when creating an auto-failover enabled cluster.

I understand your intention was not to use replicas, but when auto-failover is enabled, this is actually a requirement.

iliapolo commented 3 years ago

The use case described here actually represents an illegal state that should have never been deployed. Deployment succeeded because of a missing validation on the Elasticache service API. Keeping this issue here so we can update and resolve when a fix is available.

nicksbrandon commented 3 years ago

@iliapolo,

Many thanks for your response. It's great to hear this has uncovered a tangential issue in Elasticache.

To be clear on a point: We were unable to set automaticFailoverEnabled: false if redis is cluster mode is enabled, and it's important we use redis in cluster mode so that we can easily scale horizontally. This is a use case that is fully supported by Elasticache so we were expecting the same from CDK. For reference this issue (using automaticFailoverEnabled: false) is only reported at cdk deploy stage.

The message is

[redis name] Redis with cluster mode enabled cannot be created with auto failover turned off

iliapolo commented 3 years ago

@nicksbrandon Thats right, when cluster mode is enabled you must set automaticFailoverEnabled to true, which in turn means you must enable replicas as well.

I'm not really sure what you mean by:

This is a use case that is fully supported by Elasticache so we were expecting the same from CDK

It seems that your desired configuration is basically not supported by Elasticache, evidenced by the acknowledgment of the missing validation. I think CDK (and CloudFormation) surface this problem more frequently because of its mode of operation, invoking a full update request on every resource change. When using direct CLI invocation the issue may not happen, depending on exactly which parameters you pass.

But again this is all rooted in a faulty configuration that shouldn't have been allowed to be created in the first place.

nicksbrandon commented 3 years ago

@iliapolo

Thanks for your response.

To clarify: where carrying out the same action manually (via the console) and via CDK leads to differing results on the same Redis cache:

Scenario

I want to add a tag to a Redis cache with a single node and no failover (automaticFailoverEnabled: false) .. 2 Options: 1) Manually - The node is tagged - no interruption to service. Success. 2) Via CDK - It appears to attempt to take the node offline in order to carry out this action. Given there is no failover then this action fails.

I hope it is clear how these two approaches, to carry out the same action, yield different results. Many thanks.

MarcFletcher commented 3 years ago

@iliapolo

What we would like to do is deploy a Redis cluster (i.e. cluster mode enabled = true) with N nodes, disable failover (automaticFailoverEnabled: false) all via CDK. We are able to do this via the AWS console and can subsequently make any number of changes to such a cluster (e.g. editing billing) tags without taking the cluster offline, i.e. the automatic failover is not required, replicas are not required.

It sounds like you might be saying that even this manual workflow should not be possible in the AWS console? If you could confirm/elaborate that would be appreciated.

Regards, Marc

iliapolo commented 3 years ago

@MarcFletcher Yes it does sound like the console should not have allowed this configuration as well. I'll let @NGL321 Follow up.

github-actions[bot] commented 2 years ago

This issue has not received any attention in 1 year. If you want to keep this issue open, please leave a comment below and auto-close will be canceled.

MarcFletcher commented 2 years ago

Yea, this is still a known issue for us.

aws / aws-cdk