aws-cloudformation / cloudformation-coverage-roadmap

The AWS CloudFormation Public Coverage Roadmap
https://aws.amazon.com/cloudformation/
Creative Commons Attribution Share Alike 4.0 International
1.11k stars 56 forks source link

AWS::AutoScaling::AutoScalingGroup - Enhancement - Support Instance Refresh #2119

Open commiterate opened 2 months ago

commiterate commented 2 months ago

Feature Request

Natively support triggering and waiting for an Auto-Scaling Group (ASG) instance refresh when updating an ASG's associated launch template.

This should allow configuring ASG instance refresh parameters such as checkpoints and a CloudWatch auto-rollback alarm.

Additional Context

Differences from CloudFormation Rolling Update

CloudFormation currently supports the UpdatePolicy attribute on the AWS::AutoScaling::AutoScalingGroup resource. This lets users specify the UpdatePolicy.AutoScalingRollingUpdate attribute which will trigger a CloudFormation rolling update.

This is different from ASG instance refresh in a few ways:

Essentially, CloudFormation rolling update is less flexible and less convenient compared to the native ASG instance refresh feature.

Differences from CodeDeploy

CodeDeploy supports 2 deployment types for EC2: in-place and blue/green.

Neither can be triggered by CloudFormation for EC2 CodeDeploy applications with AWS-provided CloudFormation resources or macros today.

Both require the CodeDeploy agent to be installed on EC2 instances. This agent is used to download applications onto the instance. That is, the EC2 environment is mutable since applications aren't baked into the AMI.

ASG instance refresh, on the other hand, doesn't require any system agent and supports the immutable AMI pattern.

Like ASG instance refresh, CodeDeploy supports configurable deployment strategies and CloudWatch rollback alarms.

In-Place Deployments

In-place EC2 deployments keep the same EC2 instances but replace the applications running on them via the CodeDeploy agent.

Blue/Green Deployments

Blue/green EC2 deployments create a new ASG and mutate the Elastic Load Balancer (ELB) to shift traffic from the old ASG to the new.

This can be undesirable when used with CloudFormation because CodeDeploy is managing ASGs and ELBs out of band. The ASGs shouldn't be in the CloudFormation stack. The ELB should be in the CloudFormation stack, but mutations by CodeDeploy may cause stack drift.

This also has a lot of moving parts which can make it brittle, particularly during AWS Large Scale Events (LSEs, i.e. AWS outages).

Instance refresh, on the other hand, uses an existing ASG (which can be managed with CloudFormation) much like in-place deployments.

CloudFormation currently provides the AWS::CodeDeployBlueGreen macro which allows triggering blue/green deployments for ECS CodeDeploy applications.

This currently doesn't support EC2 CodeDeploy applications, but that is out of scope for this feature request.

Use Cases

Immutable Infrastructure for Increased Availability

Mutable EC2 environments which are updated in-place with applications or OS patches are a risk to availability. Application deployment and OS patches typically aren't coordinated in these setups, with many developers only having CI/CD pipeline tracking for the former.

For example, a silent OS patch by a separate system can bring down a service even if its application or surrounding infrastructure haven't been changed.

This exact scenario has been the source of a large Amazon outage internally within the past few years. I don't have the COE ID at hand, but this outage and moving to immutable infrastructure has been discussed several times in the internal AWS-wide ops meeting (the ops meetings have been talked about publicly at re:Invent).

There are already AWS teams that practice AMI baking (e.g. Lambda. cc: @iph for Nova) which generally were not affected by this incident. Adding CloudFormation support for ASG instance refresh can help accelerate adoption of this pattern.

AWS Region Build Acceleration

During AWS region build, AWS services are brought up in dependency order.

Since CodeDeploy comes up late in region build, this:

  1. Prevents core AWS services from depending on it (for at least their core path).
  2. Slows down region build as CodeDeploy must come up first before other AWS services can be brought up.

By letting AWS services use ASG instance refresh instead of CodeDeploy in their CloudFormation-managed applications, many AWS services will be able to build sooner.

CodeDeploy EC2 deployments have also been problematic with CI/CD pipeline services with auto-rollback support (including the Amazon-internal Pipelines service) because CloudFormation stack update and CodeDeploy EC2 deployments are modelled as 2 separate deployment targets. The problem comes from deployment target ordering where targets are updated in the same order during both rollforward and a rollback.

For example, rollforward and rollback may do the following:

  1. Issue + wait for CloudFormation stack update.
  2. Issue + wait for CodeDeploy EC2 deployment.

During rollback, the CodeDeploy deployment may reference resources that no longer exist since the CloudFormation stack was rolled back first.

CloudFormation support for ASG instance refresh would benefit both region build and deployment rollback.

Proposed Solutions

Before talking about exact CloudFormation template interfaces, the interplay between ASG instance refresh and CloudFormation needs to be considered.

The StartInstanceRefresh API request contains a DesiredConfiguration property which lets users specify the target EC2 launch template. If specified, ASG instance refresh will mutate the ASG's launch template.

In order to prevent stack drift and correctly support ASG instance refresh auto-rollback, CloudFormation should:

  1. Not update the ASG's launch template when doing a stack update.
  2. Specify the updated launch template ID/name + version in the StartInstanceRefresh call.

__Note that certain launch template settings do not work with ASG instance refresh auto-rollback (docs).__

1. UpdatePolicy.AutoScalingInstanceRefresh Attribute

To keep in line with the existing UpdatePolicy attributes, we can add an AutoScalingInstanceRefresh attribute which configures the ASG instance refresh behavior.

It's unclear if this approach is still endorsed by the CloudFormation team with the migration to Uluru (codename for CloudFormation Registry which has been leaked in several issues + PRs across various repositories in the aws-cloudformation group).

2. AWS::AutoScalingGroupInstanceRefresh Macro

This is in line with the relatively new AWS::CodeDeployBlueGreen macro used to support CodeDeploy blue/green deployments for ECS applications.

It's not clear if this macro approach is what the CloudFormation team endorses now with Uluru. Both CloudFormation rolling update and Lambda CodeDeploy blue/green use UpdatePolicy attributes, but these are both quite old and seem to predate Uluru.

It's also not clear what the region build implications of this are. Custom CloudFormation macros require a Lambda function but it's not clear what's needed for AWS-provided CloudFormation macros.

daTobiGit commented 2 months ago

There's a separate feature request to make this macro support EC2 CodeDeploy applications (https://github.com/aws-cloudformation/cloudformation-coverage-roadmap/issues/2071). That is out of scope for this feature request.

No i dont need the macro to set up ec2 blue/green in cloudformation. i just want to set the option for codedeploy via cloudformation. it can be done via CLI or in the web console already :)

commiterate commented 2 months ago

No i dont need the macro to set up ec2 blue/green in cloudformation. i just want to set the option for codedeploy via cloudformation. it can be done via CLI or in the web console already :)

Apparently I didn't read carefully enough.

Removed mention of this issue from the feature request.

commiterate commented 2 months ago

Spoke with an AWS PM for ASG.

This is planned for 2025, though the release date target hasn't been decided yet.