RobotLocomotion / drake

Model-based design and verification for robotics.
https://drake.mit.edu
Other
3.34k stars 1.26k forks source link

Build cost measurements #19099

Closed jwnimmer-tri closed 6 months ago

jwnimmer-tri commented 1 year ago

With an eye towards cost-reducing our CI spend, I think the first task would be to collect a survey of where the AWS money is going.

The initial step is to check cost for EC2 hours vs S3 vs I/O costs, etc.

Assuming the big-ticket item is EC2 hours, then we try to break that down by category: Experimental vs Continuous vs Nightly, Provisioned vs Unprovisioned, Everything vs OSS, etc. We think maybe Jenkins has logs / timings how long each build ran for, which is a fair approximation of EC2 hours. Also we have some 24/7 servers (i.e., not build runners) that should appear in the tally.

The victory condition here is a write-up that identifies what are the biggest-ticket items driving our CI spend.

BetsyMcPhail commented 1 year ago

While investigating the AWS spend, we found the following clean-up issues:

BetsyMcPhail commented 1 year ago

Approximate average monthly cost from Oct 2022 - March 2023

Total $9,219

EC2 $6,317 EC2 Static Volumes (license, Jenkins, girder, CDash, cache) $1,394 (2) Relational Database Service $927 (1) Other $548

EC2 Static instances (license, Jenkins, girder, CDash, cache) $3,725 Weekly $212 Nightly $840 Continuous $595 Pre-Merge/Experimental $945

EC2 Static Instances drake-license $86.63 drake-jenkins $693.06 drake-mongo $86.63 drake-webdav (cache) $2,079.17 drake-cdash $693.06 (3) drake-girder $86.63

(1) Where is the Relational Database Service used? (2) The Cache server volume is 90% of this cost. It's size has been reduced from 16000 GB to ~1000GB, so the cost should be greatly reduced going forward. (3) This is currently an m5n.4xlarge, we should be able to switch to a less expensive instance type

My notes: https://docs.google.com/spreadsheets/d/1jUam-Ne0IiXe4wKYHbjIcTGW8OAyc6oDFs48WSB3ogI/edit#gid=280555492

BetsyMcPhail commented 1 year ago

@jwnimmer-tri I updated the spot instance potential savings above, let me know if you need any more information, otherwise I'll pass this issue off to you.

jwnimmer-tri commented 1 year ago

drake-cdash $693.06 (3) (3) This is currently an m5n.4xlarge, we should be able to switch to a less expensive instance type

This sounds good. Assuming it's a pretty quick change, please go ahead.

I'll see if I can get more info from AWS consultants about a cheaper way to do the cache server.

The rest, we'll leave alone.

BetsyMcPhail commented 1 year ago

Link to "Recommendations for EC2 Instances" https://us-east-1.console.aws.amazon.com/compute-optimizer/home?region=us-east-1#/resources-lists/ec2

BetsyMcPhail commented 1 year ago

I spoke to one of the maintainers of CDash. A new release of CDash is going to be released soon. He suggested that we wait until that is available and do an upgrade at the same time as moving to a different instance type.

jwnimmer-tri commented 1 year ago

Checking the EC2 dashboard for our cache server, it says:

Following the link, it says:

image

jwnimmer-tri commented 1 year ago

From f2f: I think its worth trying r6g.8xlarge and seeing what happens.

jwnimmer-tri commented 1 year ago

I spoke to one of the maintainers of CDash. A new release of CDash is going to be released soon. He suggested that we wait until that is available and do an upgrade at the same time as moving to a different instance type.

I've moved this request to #19605 instead.

Last night we changed the cache server to the smaller instance type. Unless we see any problems with the new server in the next few days, we can close this ticket.

jwnimmer-tri commented 1 year ago

Seems like smooth sailing. Calling this finished.

jwnimmer-tri commented 7 months ago

We're going to revisit the cache server cost.

williamjallen commented 7 months ago

The cache server currently costs about $1200/month for a r6g.8xlarge instance with a dedicated 12Gb/s network connection, 32 vCPUs, and 256 GB of RAM.

Here's a basic overview of the network traffic for the last week binned by 5-minute intervals for the instance here: image

Based on some rough estimations, that means that even at peak times the network connection is almost never saturated. CPU usage hovers around 5% on average, and no memory usage statistics are provided for the instance so it is challenging to estimate how much is used.

Based on the resource information available, it seems like an instance type with a "bursty" network connection and only a few vCPUs would be more appropriate. See the information here about instance specs.

A couple instance types in particular look like they could be a better fit:

It's worth noting that the x2iedn.xlarge type also has a 1x118 NVMe drive attached, which will presumably offer better disk I/O performance.

We won't really know what we need or what is best until we try it, but given the potential for massive savings, I think it would be worthwhile to give r6in.xlarge a try, and then increase the size later on if the reduced memory proves to be a bottleneck.

jwnimmer-tri commented 6 months ago

To help guide the discussion...

The criteria that probably most directly affect our performance:

Note that local disk speed is probably not relevant, the only relevant part for disk is (1) big enough size and (2) it's affect on cost.

jwnimmer-tri commented 6 months ago

I take it back... the working set will never fit in RAM, we'll always be pulling some stuff from disk.

Let's try r6in.xlarge and see how it goes.

williamjallen commented 6 months ago

The cache server has been rebuilt. For now, I left the stopped instance in AWS in case we need to revert for whatever reason. We can delete it once we're confident that the new cache server is functioning.

A few items that came up during the process:

BetsyMcPhail commented 6 months ago

Cache server update seems successful so far, closing.