aws / containers-roadmap

This is the public roadmap for AWS container services (ECS, ECR, Fargate, and EKS).
https://aws.amazon.com/about-aws/whats-new/containers/
Other
5.21k stars 320 forks source link

ECS CodeDeploy canary deployments #229

Closed clareliguori closed 4 years ago

clareliguori commented 5 years ago

Similar to ECS blue-green deployments with AWS CodeDeploy, but shift a percentage of production traffic to the green fleet and monitor rollback alarms, before shifting 100% of traffic.

jespersoderlund commented 5 years ago

There also needs to be a mode where you programmatically choose to promote the canary, not only relying on alarms. We've build this kind of orchestration on-top of the existing ECS functionality but there's a lot of complexity there that would be good to have provided by the service.

clareliguori commented 5 years ago

@jespersoderlund what gates your programmatic promotions? Integration tests, manual testing, other metric sources, etc?

CodeDeploy Hooks allow for programmatic promotion between each step in the deployment lifecycle. It sounds like you need the ability to invoke a hook when a percentage of production traffic is shifted. https://docs.aws.amazon.com/codedeploy/latest/userguide/reference-appspec-file-structure-hooks.html#appspec-hooks-ecs

jespersoderlund commented 5 years ago

We have 2 types of gates that we implement today in addition to a basic "promote-if-healhty".

The problem with the "hooks" is that it's called once. We want the deployment to stop and wait for input in the manual canary promotion case, since it will be a completely async process with an unknown time between "stop" and "promote/rollback".

For the metrics one it's can also be an unpredictable time since some services might need longer time to gather enough data to make a metrics decision on whether to proceed or not.

In both cases there must be a timeout with rollback.

clareliguori commented 5 years ago

The CodeDeploy hook does stop and wait for input. It does not continue the deployment based on the function's success -- the function can actually go off and trigger some other async workflow, or notify someone that a manual approval is needed. The hook waits for something (a function, an async workflow, a person) to call the PutLifecycleEventHookExecutionStatus API. The hook timeout is configurable up to an hour, default is 30 minutes, and can trigger rollbacks.

jespersoderlund commented 5 years ago

Right, that would work then! % of traffic + hooks to allow the other types of canary-promotion triggers.

dsouzajude commented 5 years ago

It would also be great to support canary deploys for services that don't need to be behind an ALB or associated with any target group. Currently Blue-Green Deployments with ECS only support services that have an associated target group and are behind an ALB.

In some cases, like ours, we have an api service through which all traffic gets routed down to downstream (backend) services, but these reside in private subnets and are in no need of an ALB (traffic gets routed to them using HAProxy). We'd like to have canary deployment support for these as well and these deploys can be monitored by custom metrics that we have in place (such as the ones we get from logs i.e. errors or haproxy metrics).

We'd also like CloudFormation support for this.

Just some feedback from my side.

clareliguori commented 5 years ago

@dsouzajude (and others!) For non-load-balanced services, how would you expect the shifting behavior to look?

For example, a blue-green-ish deployment with initial canary:

Or perhaps a more linear progression to limit overprovisioning:

mridehalgh commented 5 years ago

@clareliguori one thing that would be great would be support for a different baseline. Instead of comparing against the old version. The baseline would use the same version as the old version however it would share the same amount of traffic as the new canary. For example to compare a 5% baseline and 5% of the new version. The goal with this is to try and reduce the likelihood of anything interfering with the analysis.

For example, does a newly provisioned service function more slowly/quicker than a warm service?

Spinnaker probably explains this better than I can: https://www.spinnaker.io/guides/user/canary/best-practices/#compare-canary-against-baseline-not-against-production

clareliguori commented 5 years ago

@mridehalgh Tell me more about how you would use the baseline and canary metrics, and where your metrics are stored (CloudWatch, other?). Spinnaker's Kayenta system compares baseline vs canary metrics using a threshold for how far apart the metric values can be to promote the deployment, while CodeDeploy uses absolute thresholds specified in CloudWatch alarms.

Btw, Spinnaker does have some support for ECS, see details in issue https://github.com/aws/containers-roadmap/issues/234

dsouzajude commented 5 years ago

@dsouzajude (and others!) For non-load-balanced services, how would you expect the shifting behavior to look?

For example, a blue-green-ish deployment with initial canary:

  • Prior: old version is at 100% of desired count
  • Step 1: set new version to 10% of desired count (110% total of desired count, new version takes ~9% of traffic)
  • Step 2: set new version to 100% of desired count (200% total of desired count, new version takes 50% of traffic)
  • Step 3: set old version to 0% of desired count (100% total of desired count, new version takes 100% of traffic)

Or perhaps a more linear progression to limit overprovisioning:

  • Prior: old version is at 100%
  • Step 1: set new version to 10% of desired count (110% total of desired count, new version takes ~9% of traffic)
  • Step 2: set old version to 90% of desired count (100% total of desired count, new version takes ~10% of traffic)
  • Step 3: set new version to 20% of desired count (110% total of desired count, new version takes ~18% of traffic)
  • Step 4: set old version to 80% of desired count (100% total of desired count, new version takes ~20% of traffic)
  • And so on, adding 10% more to new version and removing 10% from old version, until new version is at 100% and old version is at 0%

@clareliguori I would prefer a blue-greenish deploy compared to linear progression.

Another option that we currently use in our non-ecs environment is, we specify how many instances of the new canary (i.e. the desired count itself) should be allowed to run (you could also express this as a percent of the desired count). And we state that this is a "canary" deployment. But in this canary deployment, we observe how it performs w.r.t performance and functionality (i.e. errors, expected behaviour and other custom metrics). and we let it run indefinitely for X days (sometimes over the weekend or overnight) to gain more confidence in the canary deploy. Only then we manually complete the canary deployment by switching traffic over completely to the new canary.

Since we already know that service was deployed as a canary, on the next deploy we could have the option to:

  1. Complete the canary (i.e. complete the blue-green ish way of deploying the canary as you mentioned above) or
  2. Manually increase the desired count again and wait again for some time to test the canary even more with more instances of it (more traffic to it) or
  3. Rollback the canary and shift traffic to the old version if we are not satisfied with results.

Hope that makes sense. I could explain more if you require more details about my use-case.

deleugpn commented 5 years ago

For me if we at least had CloudFormation support for Blue/Green, that would be fantastic. Where I work, AWS only exist to the extend of CloudFormation support.

clareliguori commented 5 years ago

@deleugpn yep, we're tracking that in issue https://github.com/aws/containers-roadmap/issues/130

dsouzajude commented 5 years ago

Just wanted to confirm what desiredCount in this case would be? Would it be the desiredCount setup when configuring the service originally or the desiredCount at runtime (i.e. the current desiredCount) which was automatically adjusted during the events of autoscaling the service.

Just wanted to add that, on promoting the canary (or during the canary), the desiredCount at runtime should be chosen and not the desiredCount that was set originally. I wanted to ask this because we've faced this issue previously when using the boto3 API where on updating the service we needed to provide a desiredCount and this needs to be the current desiredCount that has been changed maybe due to autoscaling but ECS doesn't take this automatically into account from what i understand.

Thanks!

nathanpeck commented 5 years ago

@dsouzajude CodeDeploy deployments use task sets under the hood, which have a scale attribute that is a percentage of the service's desired count. So if the service's desiredCount is 10 and the service has two task sets at scale = 100%, each task set will have 10 tasks. If autoscaling occurs and increases the service's desiredCount to 11, each task set will launch an additional task so that each task set has 11 tasks.

ghost commented 4 years ago

@coultn do you know if this will be supported by cloudformation when it's released?

KiamarzFallahi commented 4 years ago

We are pleased to announce that your containers hosted on Amazon Elastic Container Service (Amazon ECS) can now be updated using canary or linear deployment strategies by using AWS CodeDeploy.

For more information see our announcement, visit our new blog and see the technical documentation.