Closed kohidave closed 4 years ago
Gosh, I am loving the sound of this. Thank you to the entire team for taking on this project. Been dreaming of a solution like this for ages. 💜
It seems scaling would not be part of the tool? (and live in manifest instead?) https://github.com/aws/copilot-cli/issues/810
Our plan is to support autoscaling with a field in the Manifest (not through a command). We'll update the issue once we have a concrete design :)
Howdy ya'll! I've updated the top comment with a design proposal. I would love your feedback, so please let us know what you think.
I'm especially interested in what you think of our proposal of how we should represent scaling in the manifest and how we should help folks think about fargate spot!
Thank you.
Beautiful! Kudos for an excellent write-up :clap:
Most of our use cases are covered by some combination of CPU/memory/requests target tracking.
In addition to that we also have quite a few cases where it would be sweet to trigger a scale up based on SQS depth. In our setup, those SQS queues are currently managed outside of Copilot which I assume means that they would fall into the more complicated custom CloudWatch metric
category?
Also :bow: for taking Spot into consideration. The first option with separate ranges would definitely be good enough for us.
Great, but why I can't find about this feature in the documentation?
Hi @niros1 ! Apologies that it wasn't easy to find in the docs, here is the link: https://aws.github.io/copilot-cli/docs/manifest/lb-web-service/#count does that help?
Thanks a lot, i missed that.
Is it possible to set up auto scaling based on custom CloudWatch metrics via Copilot?
Hi @Fodoj !
We haven't implemented yet the "Advanced Target Tracking" or "Step Scaling" sections of the design above. Are you looking into using target tracking with a custom CloudWatch metric? or step scaling
Hi @efekarakus, Is there a estimated timeline on when this step scaling section will be implemented? I have all the alarms created in cloudwatch. Is there an alternative to attaching step scaling to each service by going to UI?
Hi there, we are also looking for step scaling to be more effectively control the scaling. at time of writing it seems like only target tracking scaling is supported in manifest.yml
@rushim1 @rickychew77 thanks for sharing your interest in step scaling! I just created a dedicated issue (#5241) to track that feature request. If you could comment/:+1: over there, that would be great to help us prioritize.
Copilot Auto Scaling
This doc talks about introducing auto-scaling to Copilot services. Auto scaling is a feature which allows customers to automatically change the number of copies of their service (count) based on some metric, time interval, or alarm. In this design we’ll look at the types of scaling policies we want to help our customers with, how we can represent those policies in our manifest, and how we can technically implement these policies.
For sake of simplicity, we’ll assume our services are ECS/Fargate services. Unfortunately, at the time of writing this doc, we can’t take advantage of Fargate Spot (it isn’t available in CF yet) - but we’ll talk about how in the future we can allow customers to “burst” using spot capacity. EC2 services have a slew of complexities that we won’t tackle here.
Goal
The goal of this design is to agree to gather your (our customer!) feedback on:
As usual with any Manifest design, a meta goal is to provide a simple way for customers to tell us they want to do while still enabling more complex configurations through overrides or more complex types.
Types of Scaling Policies
There are three different main types of scaling policies that our customers use, and we’ll talk a little about when and why folks would use each type.
Target Tracking
Target Tracking is the latest and greatest type of auto scaling policies. The way it works is a customer specifies a desired target they’d like a particular metric to stay at or below. An example would be that you want to keep the average CPU utilization at or below 70%. When the average CPU utilization rises above 70% for N datapoints, Autoscaling will start increasing the number of tasks until the average CPU utilization falls to or below 70% again.
An interesting note is that the speed at which Autoscaling will create new tasks is proportional to how far over your target threshold you are. For example, if your CPU rises to 80%, Autoscaling will provision (desired count) * 80/70. It will then pause for a period (scale out cool down period), and then try again. Since, under the hood autoscaling is powered by alarms, and alarms typically have a resolution of 1minute, this limits the speed that one receive new data about your service and can scale up.
additional considerations when dealing with target tracking policies
Scheduled Scaling
Scheduled Scaling initiates scaling events at certain times. With scheduled scaling, you provide either a date, rate or cron expression to trigger a scaling event. When the scaling event is triggered, it can set the desired count of your service to be at least some number, and at most another number. This means your scheduling events can trigger both scale-in (reducing the # of tasks) and scale-out (increasing the # of tasks). Practically, folks will need to create multiple scheduled scaling policies (one for scaling up, and one for scaling back down again). An example:
Step Scaling
Step scaling allows you to scale up based on the magnitude of an alarm breach. You provide an alarm, and based on predefined ranges of how how far the alarm metric has been breached, determines how many tasks to increase. You can run a scale-in version which does the opposite. It’s kind of confusing to explain, but let me give you an example.
Assume we have an alarm for service throttling exceptions. It triggers when the average number of throttles for our service is over 50/min. We could have a step scaling policy like:
Just for completeness, we’ll say our cooldown period (the time between scaling actions) is 60 seconds.
In this example, as long as the alarm metric is between 50-60mins, we’ll keep spinning up 1 task per 60 seconds (the cooldown period). If the metric goes up even more, like all of our requests are getting throttled, then we might want to provision tasks faster (so maybe 10 tasks per cooldown period).
You can do this with a negative version as well, for scaling in.
Many step scaling use cases can be solved with target tracking.
Manifest Design
So in this design, I want to focus mostly on the Target Tracking design. We can do a deeper dive into scheduled and step scaling policies in a separate design. I suspect target tracking will be the most popular scaling approach.
Refresher, our goals for the manifest design is to:
Simple Target Tracking
Target tracking is so common that we have a couple of built in metrics around it. This example shows us overloading the count type to reveal the min/max our service can scale, as well as some predefined scaling targets. In this example we show the three predefined scaling targets:
While in this example I only show one uncommented value, but you could specify all three. Each provided value will generate its own scaling policy.
Expected Usage: most common
For our default manifests we can keep the generated count: 1, but have a commented out prod override section. Other metrics we could build: We may want to add another convenience method for SQS queues. Perhaps something like the below. Our CDK patterns have us scale using step scaling for SQS.
We may want to add another convenience method for target request time. (ALB only)
Advanced Target Tracking
The above simple target tracking is often expressive enough for most folks, but you can specify more sophisticated target tracking policies like this:
One large callout with custom metrics is that they often call for exact resource names. This will be difficult for us to facilitate without some sort of templating such as adding
!Addons MySQSQueue.ARN
or something which can help resolve addons outputs. I’ll punt on designing this for now but there are a bunch of options to look at here.Step Scaling
In general, we’ll assume that step scaling is more of an advanced feature so we’ll include less convenience methods around it. The real question here is about the alarm - where do customers generate it? They can add them via addons, but if the metric needs to reference the service at all, that won’t work. We’ll assume the alarm already exists in this example, but until we can figure out where/how to generate these alarms, we can’t effectively support step-scaling.
Scheduled Scaling
To trim the scope of this design down, we'll take a look at scheduled scaling , separately.
Spot Capacity
Allowing customers to scale using Spot Capacity (or just utilize spot capacity at all) will help folks save money while still optimizing for availability. Since our scaling work expands the
count
section, we can take this opportunity to think about the way that customers can tell Copilot how they’d like to use spot.Option 1 - separate ranges In this example, customers can provide two ranges. One for dedicated fargate instances, and another for spot. This is nice because it allows customers to keep a number of dedicated instances up, but burst into spot. It also allows them switch the ranges. This also allows customers to opt into spot, completely.
Option 2 - spot as percent
This option has customers not specify a particular range for spot, but instead specify a percentage that is spot. I’m not sure if this is how folks actually use Fargate Spot.
Cost Constraints
Another bit of feedback that we've heard is allowing customers specify a max cost - instead of a range of tasks.
This option is really interesting - it'd require some precomputing on our part, and shifting capacity between spot and dedicated. We'd have to make some decisions around the break down of spot/dedicated tasks which might be difficult to optimize for. The benefit is that as folks change their Fargate task size (mem/cpu) they'd automatically have their service scale back the number of tasks provisioned (this may also be surprising).
There may be more options that I’m not thinking of here, so please let me know if you have any awesome ideas!