st2CICD Reducing costs - Githubissues

amanda11 commented 2 years ago

Our current AWS CICD server costs approximately $122.4 plus tax to run a month just on EC2 costs (other costs such as EBS not included). This is a c5.xlarge server running 24x7 (cost above based on 30 day month).

The actual CI/CD runs currently are:

unstable Tuesday 3pm UTC - for approx 40 mins
stable Wednesday 1pm UTC - for approx 40 mins
unstable Friday 3pm UTC - for approx 40 mins Each day at 1am UTC it also runs a check to see if any of the instances created to run the CI jobs need deleting.

We have a number of ways we could reduce this cost, some of these can be combined:

Change date/time of runs and then keep servers down for part of week, but up enough to allow debugging
Change date/time of runs and get the logs from CI runs onto a S3 bucket, and then keep server down for longer
Move to cheaper costing image size e.g. c5a.xlarge
Rebuild new server with latest ST2 and then downsize further

Once we loose credits, and by time we add in tax costs it would be good to reduce this cost.

Interested in people's preferences and thoughts.

amanda11 commented 2 years ago

One possible reduction which would amount to reducing pre-tax EC2 CI/CD server to $81 or $74 would be following:

Switch off the CI/CD server from Friday 4pm UTC to Monday 4am UTC
Move the CI runs so that Unstable runs Monday + Thursday, Keep Stable on Wednesday. This means we still have the 2am runs the following days to delete servers, and servers are up for most of week to enable troubleshooting.
This reduces to $81.6 + tax for c5.xlarge, and $74 for c5a.xlarge

However if we use the AWS Instance Scheduler that probably costs us $10 a month to achieve that (https://docs.aws.amazon.com/solutions/latest/instance-scheduler-on-aws/cost.html).

Or for our use case, we should be able to achieve the same for less just with Lambda functions, eg. https://aws.amazon.com/premiumsupport/knowledge-center/start-stop-lambda-eventbridge/ as we don't really need those lambda functions running every 5 mins every day which the instance scheduler is doing.

This might be an easy win to reduce the costs quickly but then move onto a solution where we get the logs off the boxes so don't need the CI servers for debugging.

Also perhaps the ST2 workflow that deletes old running EC2 instances that are older than 6 hours from a CI/CD server, could be removed and replaced by a Lambda function to delete any that are running at particular times of day? Then we don't need to keep CI server up for 6 hours after a run.

We could also tighten the schedule even more, e.g. just have both stable and unstable run on the same day and once a week, e.g. Tuesday 1pm UTC run stable Tuesday 3pm UTC run unstable Wednesday 2am run clean up job.

Then we could keep the servers up maybe just from Tuesday 10am UTC to Thursday 10pm UTC. This would then just have a cost of approx $40 + tax a month for the c5a.xlarge instance. And gives us 54 hours after a failed run to debug. Which could be an interim solution, until either we rebuild with newer ST2 version and smaller size, plus getting logs off box so we can investigate a failure without the EC2 instance being up.

winem commented 2 years ago

A solution that is a bit less flexible compared to the AWS Instance Scheduler is to use Scheduled scaling for EC2 Auto Scaling: https://docs.aws.amazon.com/autoscaling/ec2/userguide/ec2-auto-scaling-scheduled-scaling.html

It allows you to configure any ASG capacity (min, max and desired) to be applied at any time with a cron-like syntax.

Let me just share a screenshot to show the capabilities: Screenshot from 2022-06-28 16-12-02

So, when we say

The actual CI/CD runs currently are:

    unstable Tuesday 3pm UTC - for approx 40 mins
    stable Wednesday 1pm UTC - for approx 40 mins
    unstable Friday 3pm UTC - for approx 40 mins
    Each day at 1am UTC it also runs a check to see if any of the instances created to run the CI jobs need deleting.

We could have a config like

start every Tuesday 2:30 pm UTC
start every Wednesday 12:30 pm UTC
start every Friday 2:30 pm UTC

The shutdown would be scheduled to a later point of time which depends on whether we want to keep the instances alive for debugging or export relevant logs to S3 for example.

Each day at 1am UTC it also runs a check to see if any of the instances created to run the CI jobs need deleting. may be moved to Lambda triggered once a day if it's running on an instance that is not needed otherwise.

Terraform documentation: https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/autoscaling_schedule

An example to start an instance every Tuesday at 1 pm and shut it down at 3 pm again:

resource "aws_autoscaling_schedule" "unstable-start" {
  scheduled_action_name  = "unstable_start"
  min_size                           = 1
  max_size                          = 1
  desired_capacity              = 1
  recurrence                        = "0 1 * * Tue"
  autoscaling_group_name = ...
}

# here we just reduce the capacity to 0 to make sure that AWS shuts down all EC2 resources 
resource "aws_autoscaling_schedule" "unstable-stop" {
  scheduled_action_name  = "unstable_stop"
  min_size                           = 0
  max_size                          = 0
  desired_capacity              = 0
  recurrence                        = "0 3 * * Tue"
  autoscaling_group_name = ...
}

amanda11 commented 2 years ago

@winem Doesn't the above terminate rather than just shutdown the instance? https://docs.aws.amazon.com/autoscaling/ec2/userguide/auto-scaling-groups.html - "and launches or terminates the instances as needed". I think in our case we just want to start/stop it rather than create and terminate.

arm4b commented 2 years ago

I think it's a great idea with the lambda function to start/stop the instance on a schedule :+1:

amanda11 commented 1 year ago

Agreed in TSC meeting for go-ahead to start/stop via lambda and keep CI server up for just a few days a week.

amanda11 commented 1 year ago

Schedule of new builds (times based on US/Pacific):

Tues @ 18:00 Orphan run (cicd repo) Wed @ 6:00 Stable run (cd repo - no change) Wed @ 8:00 Unstable run (ci repo) Wed @ 18:00 Orphan run (cicd repo) Thur @ 8:00 Unstable run (ci repo) Thur @ 18:00 Orphan run (cicd repo)

Schedule AWS Lambda control the st2cicd EC2 instance with:

Start Tues@16:00 Stop Thurs@20:00 Therefore reducing to 52 hrs a week instead of 168.

amanda11 commented 1 year ago

Lambda functions and rules setup. Monitoring changed to just Weds. Awaiting stop on Friday 4am UTC, and start at Tues midnight UTC.

StackStorm / community

st2CICD Reducing costs #104