Build cost measurements

jwnimmer-tri commented 1 year ago

With an eye towards cost-reducing our CI spend, I think the first task would be to collect a survey of where the AWS money is going.

The initial step is to check cost for EC2 hours vs S3 vs I/O costs, etc.

Assuming the big-ticket item is EC2 hours, then we try to break that down by category: Experimental vs Continuous vs Nightly, Provisioned vs Unprovisioned, Everything vs OSS, etc. We think maybe Jenkins has logs / timings how long each build ran for, which is a fair approximation of EC2 hours. Also we have some 24/7 servers (i.e., not build runners) that should appear in the tally.

The victory condition here is a write-up that identifies what are the biggest-ticket items driving our CI spend.

BetsyMcPhail commented 1 year ago

While investigating the AWS spend, we found the following clean-up issues:

[x] Remove unused images including focal-packaging and draketest-3-dev
[x] Reduce size of Linux cache volume, from 16000GB to ~1000 GB
[x] Cleanup old snapshots (there are many that are clearly not used e.g. Bionic snapshots from 2021)
[ ] Investigate automatic backup of Snapshots and Images. In particular, how long are they kept? Also, does the new cache server need to be backed up?
- There is now sufficient documentation to recreate the cache server from scratch, if necessary. It does not need to get backed up.
[x] Investigate elastic IPs - do we use/need them?
- Yes! For fast cache server internet. ("EAP")
[ ] Investigate alarms, how are they setup? Is the current set reasonable? Should any be set on the new cache server?
[ ] Drake-girder elastic beanstalk environment platform Python 2.7 has been “retired” since October 31, 2020
[ ] We will lose access to the current billing info in July? See https://aws.amazon.com/blogs/aws-cloud-financial-management/changes-to-aws-billing-cost-management-and-account-consoles-permissions/
[ ] Investigate use of spot instances. We currently use c5d.4xlarge instances. According to this table: https://aws.amazon.com/ec2/instance-types/c5/, we could save 38% switching to spot instances.

BetsyMcPhail commented 1 year ago

Approximate average monthly cost from Oct 2022 - March 2023

Total $9,219

EC2 $6,317 EC2 Static Volumes (license, Jenkins, girder, CDash, cache) $1,394 (2) Relational Database Service $927 (1) Other $548

EC2 Static instances (license, Jenkins, girder, CDash, cache) $3,725 Weekly $212 Nightly $840 Continuous $595 Pre-Merge/Experimental $945

EC2 Static Instances drake-license $86.63 drake-jenkins $693.06 drake-mongo $86.63 drake-webdav (cache) $2,079.17 drake-cdash $693.06 (3) drake-girder $86.63

(1) Where is the Relational Database Service used? (2) The Cache server volume is 90% of this cost. It's size has been reduced from 16000 GB to ~1000GB, so the cost should be greatly reduced going forward. (3) This is currently an m5n.4xlarge, we should be able to switch to a less expensive instance type

My notes: https://docs.google.com/spreadsheets/d/1jUam-Ne0IiXe4wKYHbjIcTGW8OAyc6oDFs48WSB3ogI/edit#gid=280555492

BetsyMcPhail commented 1 year ago

@jwnimmer-tri I updated the spot instance potential savings above, let me know if you need any more information, otherwise I'll pass this issue off to you.

jwnimmer-tri commented 1 year ago

drake-cdash $693.06 (3) (3) This is currently an m5n.4xlarge, we should be able to switch to a less expensive instance type

This sounds good. Assuming it's a pretty quick change, please go ahead.

I'll see if I can get more info from AWS consultants about a cheaper way to do the cache server.

The rest, we'll leave alone.

BetsyMcPhail commented 1 year ago

Link to "Recommendations for EC2 Instances" https://us-east-1.console.aws.amazon.com/compute-optimizer/home?region=us-east-1#/resources-lists/ec2

BetsyMcPhail commented 1 year ago

I spoke to one of the maintainers of CDash. A new release of CDash is going to be released soon. He suggested that we wait until that is available and do an upgrade at the same time as moving to a different instance type.

jwnimmer-tri commented 1 year ago

Checking the EC2 dashboard for our cache server, it says:

Following the link, it says:

jwnimmer-tri commented 1 year ago

From f2f: I think its worth trying r6g.8xlarge and seeing what happens.

jwnimmer-tri commented 1 year ago

I spoke to one of the maintainers of CDash. A new release of CDash is going to be released soon. He suggested that we wait until that is available and do an upgrade at the same time as moving to a different instance type.

I've moved this request to #19605 instead.

Last night we changed the cache server to the smaller instance type. Unless we see any problems with the new server in the next few days, we can close this ticket.

jwnimmer-tri commented 1 year ago

Seems like smooth sailing. Calling this finished.

jwnimmer-tri commented 7 months ago

We're going to revisit the cache server cost.

williamjallen commented 7 months ago

The cache server currently costs about $1200/month for a r6g.8xlarge instance with a dedicated 12Gb/s network connection, 32 vCPUs, and 256 GB of RAM.

Here's a basic overview of the network traffic for the last week binned by 5-minute intervals for the instance here:

Based on some rough estimations, that means that even at peak times the network connection is almost never saturated. CPU usage hovers around 5% on average, and no memory usage statistics are provided for the instance so it is challenging to estimate how much is used.

Based on the resource information available, it seems like an instance type with a "bursty" network connection and only a few vCPUs would be more appropriate. See the information here about instance specs.

A couple instance types in particular look like they could be a better fit:

r6in.xlarge: $0.34866/hr (~21% of the current cost), 4 vCPUs, 32 GB RAM, baseline bandwidth of 6.25 Gb/s, burst bandwidth of 30 Gb/s
x2iedn.xlarge: $0.83363/hr (~51% of the current cost), 4 vCPUs, 128 GB RAM, baseline bandwidth of 1.875 Gb/s, burst bandwidth of 25 Gb/s

It's worth noting that the x2iedn.xlarge type also has a 1x118 NVMe drive attached, which will presumably offer better disk I/O performance.

We won't really know what we need or what is best until we try it, but given the potential for massive savings, I think it would be worthwhile to give r6in.xlarge a try, and then increase the size later on if the reduced memory proves to be a bottleneck.

jwnimmer-tri commented 6 months ago

To help guide the discussion...

The criteria that probably most directly affect our performance:

Network bandwidth / burst
Enough RAM to fit the https://drake-jenkins.csail.mit.edu/view/Continuous%20Production/ working set for a build of HEAD in RAM. I'll post back with how to measure this.

Note that local disk speed is probably not relevant, the only relevant part for disk is (1) big enough size and (2) it's affect on cost.

jwnimmer-tri commented 6 months ago

I take it back... the working set will never fit in RAM, we'll always be pulling some stuff from disk.

Let's try r6in.xlarge and see how it goes.

williamjallen commented 6 months ago

The cache server has been rebuilt. For now, I left the stopped instance in AWS in case we need to revert for whatever reason. We can delete it once we're confident that the new cache server is functioning.

A few items that came up during the process:

There are a bunch of volumes which aren't attached to instances which are suspiciously similar to cache server volumes. I will attempt to track down where these came from, and eliminate them if they are in fact old cache server volumes. This could save $150+/month in storage costs.
It's unclear why the elastic IP isn't being used for the cache server health check. I had to make https://github.com/RobotLocomotion/drake-ci/pull/283 to change the private IP.

BetsyMcPhail commented 6 months ago

Cache server update seems successful so far, closing.

RobotLocomotion / drake

Build cost measurements #19099