v0.40.0 requires two runs to create & apply security group

syphernl commented 3 years ago

Describe the Bug

The v0.40.0 release requires two plan/apply stages to be executed to be fully useful.

In the first run, the security group is created:

  # module.redis.module.security_group.aws_security_group.default[0] will be created
  + resource "aws_security_group" "default" {
      + arn                    = (known after apply)
      + description            = "ElastiCache Security Group"
      + egress                 = (known after apply)
      + id                     = (known after apply)
      + ingress                = (known after apply)
      + name                   = "PROJECT-prod-redis"
      + name_prefix            = (known after apply)
      + owner_id               = (known after apply)
      + revoke_rules_on_delete = false
      + tags                   = {
          + "Name"      = "PROJECT-prod-redis"
          + "Namespace" = "PROJECT"
          + "Stage"     = "prod"
        }
      + tags_all               = {
          + "Name"      = "PROJECT-prod-redis"
          + "Namespace" = "PROJECT"
          + "Stage"     = "prod"
        }
      + vpc_id                 = "vpc-xxxxxxxxxxxxxxxxx"
    }

This security group is however not applied until the second run:

  # module.redis.aws_elasticache_replication_group.default[0] will be updated in-place
  ~ resource "aws_elasticache_replication_group" "default" {
        id                            = "PROJECT-prod-redis"
      ~ security_group_ids            = [
          - "sg-xxxxxxxxxxxx",
          + "sg-xxxxxxxxxxxx",
          - "sg-xxxxxxxxxxxx",
        ]
        tags                          = {
            "Name"      = "PROJECT-prod-redis"
            "Namespace" = "PROJECT"
            "Stage"     = "prod"
        }
        # (28 unchanged attributes hidden)

        # (1 unchanged block hidden)
    }

Expected Behavior

Security Group to be created & assigned in the same run.

Configuration

module "redis" {
  source        = "git::https://github.com/cloudposse/terraform-aws-elasticache-redis.git?ref=0.40.0"
  namespace     = local.label.namespace
  stage         = local.label.stage
  name          = "redis"
  zone_id       = aws_route53_zone.private.zone_id
  dns_subdomain = "redis"

  security_group_rules = [
    {
      type                     = "egress"
      from_port                = 0
      to_port                  = 65535
      protocol                 = "-1"
      cidr_blocks              = ["0.0.0.0/0"]
      source_security_group_id = null
      description              = "Allow all outbound traffic"
    },
    {
      type                     = "ingress"
      from_port                = 6379
      to_port                  = 6379
      protocol                 = "tcp"
      cidr_blocks              = []
      source_security_group_id = module.api.ecs_service_security_group_id
      description              = "Allow inbound Redis traffic from ECS"
    },
    {
      type                     = "ingress"
      from_port                = 6379
      to_port                  = 6379
      protocol                 = "tcp"
      cidr_blocks              = []
      source_security_group_id = join(",", module.bastion.security_group_ids)
      description              = "Allow inbound Redis traffic from Bastion"
    },
  ]

  auth_token                       = random_string.redis_auth_token.result
  vpc_id                           = module.vpc.vpc_id
  subnets                          = module.dynamic_subnets.private_subnet_ids
  cluster_size                     = "1"
  instance_type                    = "cache.t3.micro"
  engine_version                   = "5.0.6"
  family                           = "redis5.0"
  apply_immediately                = true
  availability_zones               = local.availability_zones
  automatic_failover_enabled       = false
  cloudwatch_metric_alarms_enabled = true
}

Environment (please complete the following information):

Anything that will help us triage the bug will help. Here are some ideas:

OS: Linux
Version v0.40.0

Additional Context

Add any other context about the problem here.

marcuz commented 3 years ago

This is not only "requires 2 apply", upgrade from 0.39.0 to 0.40.0 causes downtime because of the reason described.

nitrocode commented 3 years ago

Did both of you do terraform state mv for existing elasticache groups or were elasticache group modules upgraded w/o state moves ?

Or was the above applied on a brand new elasticache ?

marcuz commented 3 years ago

Did both of you do terraform state mv for existing elasticache groups or were elasticache group modules upgraded w/o state moves ?

I did not, just terraform apply. Tomorrow I can test a production-like upgrade procedure and check whether we can achieve that without downtime if it helps?

syphernl commented 3 years ago

@nitrocode I did not do a terraform state mv either for the Elasticache groups.

Nuru commented 3 years ago

@syphernl Can you give us more details about why the upgrade to v0.40.0 causes 2 runs? I am curious if it is because of a propagation delay or a consequence of apply_immediately = false or something else.

What version of Terraform were you using?

As mentioned elsewhere, we have marked v0.40.0 pre-release and recommend using 0.39.0 for now. However, this bug could persist if it is because of something we overlooked, which appears to be the case, so we appreciate your help in understanding this issue better.

syphernl commented 3 years ago

@Nuru At the time we were using Terraform 1.0.0.

In order to get this change to apply I had to login to the console, go to the Redis Cluster and change the security groups by hand to something else. Otherwise Terraform could not remove the old Security Group and would eventually fail. This change was executed before running this state change.

I don't know why it didn't apply the new security group (and remove the one's that shouldn't be there) in the first run.

Nuru commented 3 years ago

@syphernl Terraform is not aware that to remove a security group, you have to remove all its associations first, so I would expect its plan to be

delete the old security group
create the new security group
update the Elasticache replication group

I would expect step 1 to fail because the security group was still associated, step 2 to succeed, and step 3 to fail because step 1 failed. Then you manually disassociated the old SG, so on the next run Terraform could succeed at step 1, skip step 2 because that had already succeeded, and move on to step 3. Do I have that right?

syphernl commented 3 years ago

@Nuru Sounds logical. But the whole process only failed for the first environment we updated. For all subsequent environments we first manually disassociated the old (to be removed) security group from the Elasticache Cluster, temporarily assigned the VPC SG to prevent this behavior.

Nuru commented 3 years ago

I have looked into this extensively and believe it is a bug in the AWS Terraform provider. I have not been able to find a workaround. Please upvote https://github.com/hashicorp/terraform-provider-aws/issues/20139 and if you have any additional information, please share it there.

Nuru commented 2 years ago

Opened Terraform issue https://github.com/hashicorp/terraform/issues/29799 to track the problem. Will publish changes that work around the issue.

cloudposse / terraform-aws-elasticache-redis