aws / copilot-cli

The AWS Copilot CLI is a tool for developers to build, release and operate production ready containerized applications on AWS App Runner or Amazon ECS on AWS Fargate.
https://aws.github.io/copilot-cli/
Apache License 2.0
3.48k stars 400 forks source link

Worker service scaling from queue with a minimum range of 0 #3376

Open callikai opened 2 years ago

callikai commented 2 years ago

I am having trouble with a worker service scaling up when items are added to the SQS queue:

Manifest example:

subscribe:
  topics:
    - name: serviceTopic
      service: servicename
  queue:
    retention: 96h
    timeout: 20m
    dead_letter:
      tries: 3

count:
  range:
    min: 0
    max: 10
    spot_from: 2
  queue_delay:
    acceptable_latency: 10m
    msg_processing_time: 10m

The alarm that is created only triggers to scale up if the queue is >1. My service at times is low usage and there is often 0 or 1 message in the queue. Am I doing something wrong? Feels like some logic is missing that there should be at least one worker running if there is something in the queue.

huanjani commented 2 years ago

Hello, @callikai!

Would you mind telling me more? Are you experiencing that your number of tasks is going down to 0 even when there is a message in the queue? Are you expecting more than 0 or 1 message in the queue?

Thanks!

callikai commented 2 years ago

Hi!

The number of tasks begins at 0. When one message is added to the queue, no worker task is added and the message is not processed.

huanjani commented 2 years ago

Ahh, thank you for clarifying, @callikai!

Thanks for including your manifest-- I see that because acceptable_latency and msg_processing_time are equal, the "acceptable backlog per instance," which is calculated by acceptable_latency/msg_processing_time, is equal to 1. So you're right-- it's only launching a task if there is more than one message in the queue.

Your use case seems like a common one, and we are trying to build in a way to make it possible (now that you have surfaced this-- thank you!). For now, though, could you make the min value for your range 1? We understand that you'd rather not have the one task running if there are no messages; maybe you could do spot_from: 1 to minimize costs for now.

count:
  range:
    min: 1
    max: 10
    spot_from: 1

Thank you!

callikai commented 2 years ago

As a workaround I am managing the desiredCount of the service by externally monitoring the queue, but setting the min at 1 does also work in a pinch. Thank you!

samuelduchesne commented 11 months ago

Hi @huanjani! I have a very similar use case. I was wondering if the feature has been developed? I am looking to scale down to zero when the SQS queue is empty and scale up as soon as messages start pilling up. Thanks!

huanjani commented 11 months ago

Hi, @samuelduchesne (and @callikai, @aatt44zz)!

Thanks for the question. We have not developed this specific feature, but our recent feature, YAML patch overrides, provides a workaround!

  1. run copilot svc override
  2. add:
- op: replace
  path: /Resources/AutoScalingPolicyEventsQueue/Properties/TargetTrackingScalingPolicyConfiguration/TargetValue
  value: 0

to the generated cfn.patches.yml file

  1. run copilot svc deploy again

Let us know how that works for you!

samuelduchesne commented 11 months ago

Thanks! I'll give it a try! Quickly, what about that acceptable_latency/msg_processing_time equal to 1 case? Won't the scaling policy kick in only if there is more than one (>1) backlog tasks?

samuelduchesne commented 11 months ago

hi @huanjani!

I applied the patch and deployed, but got the following error:

✔ Proposing infrastructure changes for stack <stack name>
- Updating the infrastructure for stack <stack name>               [update rollback complete]  [19.0s]
  The following resource(s) failed to update: [AutoScalingPolicyEventsQu                                         
  eue].                                                                                                          
  - An autoscaling policy to maintain 1 messages/task for EventsQueue                [update complete]           [0.0s]
    For target tracking scaling, target value must be between '8.51592E-10                                       
    9' and '1.174271E108', but was '0.0'. (Service: AWSApplicationAutoScal                                       
    ing; Status Code: 400; Error Code: ValidationException; Request ID: f7                                       
    f5687f-5246-4531-b08e-27a1538f6c63; Proxy: null)                                                             
  - An autoscaling target to scale your service's desired count                      [not started]                
  - A custom resource returning the ECS service's running task count                 [update complete]           [3.6s]
  - An ECS service to run and maintain your tasks in the environment cluster         [not started]                

✘ execute svc deploy: deploy service mlfit to environment dev: deploy service: stack <stack name> did not complete successfully and exited with status UPDATE_ROLLBACK_COMPLETE
(.venv) samueld@M2-Max % 

It seems that the target value must be higher than 0 (>0).

huanjani commented 11 months ago

Oh, darn! Sorry for the churn. What if it's a value between 0 and 1, so that it'll scale up if there is 1 message. I suppose that doesn't solve the problem of scaling down to 0, though. But I did just stumble upon this: https://github.com/aws/copilot-cli/issues/3054#issuecomment-1453911639!

samuelduchesne commented 11 months ago

@huanjani right! A worker service can indeed scale down to 0.

I experienced the 3m delay before tasks scale up and the 15m delay before the tasks are stopped, as detailed in #3450.

In my case, I can very well have only one message in the queue. Given that, acceptable_latency/msg_processing_time must be >= 1, and that the CloudWatch alarm looks at BackLogsPerTask > 1 (here the higher than is important as opposed to higher or equal to), the service never scales up from 0.

samuelduchesne commented 11 months ago

I can go into CloudWatch and manually change the rule to higher or equal and that seems to fix it. But that defeats the purpose of using Copilot 😜. What would a svc override look like to modify the CloudWatch alarm? Perhaps here is a good starting point?

huanjani commented 11 months ago

Hi, @samuelduchesne! Sorry for the delay-- I'm not ignoring you, I'm just deep into figuring out a good way to patch the threshold, given the fact that we're using a target tracking scaling policy, which is setting the high and low bounds. Stay tuned!

huanjani commented 11 months ago

All right.... There isn't a quick and easy YAML patch that we can do to just make the threshold > into >= like you can in the console, unfortunately. This is a similar question: https://stackoverflow.com/questions/55730247/set-the-cloudwatch-alarm-high-and-low-thresholds-for-aws-fargate. @nathanpeck's solution of using step scaling instead of target tracking would work, but is more surgery than you probably want to undertake.

I'm curious if you tried setting the target value to, say, .5, or something else between 0 and 1, and if that didn't work. I tested the above patch (after fixing the capitalization, sorry) and my worker service deployed with the desired target value. I haven't yet testing the autoscaling with messages.

samuelduchesne commented 11 months ago

@huanjani Quickly tried the override with 0.5 instead of the default 1 and what happened is that twice as many tasks were launched! for instance I had 974 messages in the queue. My acceptable_latency/msg_processing_time is equal to 2.

Screenshot 2023-09-22 at 3 11 17 PM
huanjani commented 11 months ago

Oh jeez! Thanks for trying it out. I'd say for now use your CloudWatch console adjustment if that's giving you what you need, and we'll keep looking into this! Sorry for all the back-and-forth!

samuelduchesne commented 11 months ago

Hi @huanjani! I wanted to give you an update on my exploration. Setting up the Worker Service with range 0-x does in fact scale down to zero. I have a BacklogPerTask of 3. My messages take ~50s to process. I plotted a similar graph as https://github.com/aws/copilot-cli/issues/3054#issuecomment-1453911639, but as show by the green line (number of tasks in the service), the scale down process takes quite a while and follows this conservative curve. That can be quite expensive in my case. My question then becomes, how can we make that scale down happen faster much more pronounced?

Screenshot 2023-09-29 at 8 08 05 AM
Lou1415926 commented 11 months ago

Hey @samuelduchesne ! I recommend trying out count.cooldown.in first. The default scale-in cooldown is 120s. On what the "cooldown" field does:

For scale-in events, the intention is to scale in conservatively to protect your application's availability, so scale-in activities are blocked until the cooldown period has expired. However, if another alarm initiates a scale-out activity during the scale-in cooldown period, Service Auto Scaling scales out the target immediately. In this case, the scale-in cooldown period stops and doesn't complete. (doc)

You can specify a shorter cooldown period to speed up the scale-in!

furkan3ayraktar commented 5 months ago

We have a similar use case where we would like to scale from 0 tasks. If the BacklogPerTask is greater than or equal to one, no tasks are initialised until there are messages in the queue more than BackLogPerTask. Any solutions or workarounds on this?