Closed jwnimmer-tri closed 6 months ago
While investigating the AWS spend, we found the following clean-up issues:
focal-packaging
and draketest-3-dev
Approximate average monthly cost from Oct 2022 - March 2023
Total $9,219
EC2 $6,317 EC2 Static Volumes (license, Jenkins, girder, CDash, cache) $1,394 (2) Relational Database Service $927 (1) Other $548
EC2 Static instances (license, Jenkins, girder, CDash, cache) $3,725 Weekly $212 Nightly $840 Continuous $595 Pre-Merge/Experimental $945
EC2 Static Instances drake-license $86.63 drake-jenkins $693.06 drake-mongo $86.63 drake-webdav (cache) $2,079.17 drake-cdash $693.06 (3) drake-girder $86.63
(1) Where is the Relational Database Service used? (2) The Cache server volume is 90% of this cost. It's size has been reduced from 16000 GB to ~1000GB, so the cost should be greatly reduced going forward. (3) This is currently an m5n.4xlarge, we should be able to switch to a less expensive instance type
@jwnimmer-tri I updated the spot instance potential savings above, let me know if you need any more information, otherwise I'll pass this issue off to you.
drake-cdash $693.06 (3) (3) This is currently an m5n.4xlarge, we should be able to switch to a less expensive instance type
This sounds good. Assuming it's a pretty quick change, please go ahead.
I'll see if I can get more info from AWS consultants about a cheaper way to do the cache server.
The rest, we'll leave alone.
Link to "Recommendations for EC2 Instances" https://us-east-1.console.aws.amazon.com/compute-optimizer/home?region=us-east-1#/resources-lists/ec2
I spoke to one of the maintainers of CDash. A new release of CDash is going to be released soon. He suggested that we wait until that is available and do an upgrade at the same time as moving to a different instance type.
Checking the EC2 dashboard for our cache server, it says:
Following the link, it says:
From f2f: I think its worth trying r6g.8xlarge
and seeing what happens.
I spoke to one of the maintainers of CDash. A new release of CDash is going to be released soon. He suggested that we wait until that is available and do an upgrade at the same time as moving to a different instance type.
I've moved this request to #19605 instead.
Last night we changed the cache server to the smaller instance type. Unless we see any problems with the new server in the next few days, we can close this ticket.
Seems like smooth sailing. Calling this finished.
We're going to revisit the cache server cost.
The cache server currently costs about $1200/month for a r6g.8xlarge
instance with a dedicated 12Gb/s network connection, 32 vCPUs, and 256 GB of RAM.
Here's a basic overview of the network traffic for the last week binned by 5-minute intervals for the instance here:
Based on some rough estimations, that means that even at peak times the network connection is almost never saturated. CPU usage hovers around 5% on average, and no memory usage statistics are provided for the instance so it is challenging to estimate how much is used.
Based on the resource information available, it seems like an instance type with a "bursty" network connection and only a few vCPUs would be more appropriate. See the information here about instance specs.
A couple instance types in particular look like they could be a better fit:
r6in.xlarge
: $0.34866/hr (~21% of the current cost), 4 vCPUs, 32 GB RAM, baseline bandwidth of 6.25 Gb/s, burst bandwidth of 30 Gb/sx2iedn.xlarge
: $0.83363/hr (~51% of the current cost), 4 vCPUs, 128 GB RAM, baseline bandwidth of 1.875 Gb/s, burst bandwidth of 25 Gb/sIt's worth noting that the x2iedn.xlarge
type also has a 1x118 NVMe drive attached, which will presumably offer better disk I/O performance.
We won't really know what we need or what is best until we try it, but given the potential for massive savings, I think it would be worthwhile to give r6in.xlarge
a try, and then increase the size later on if the reduced memory proves to be a bottleneck.
To help guide the discussion...
The criteria that probably most directly affect our performance:
Note that local disk speed is probably not relevant, the only relevant part for disk is (1) big enough size and (2) it's affect on cost.
I take it back... the working set will never fit in RAM, we'll always be pulling some stuff from disk.
Let's try r6in.xlarge
and see how it goes.
The cache server has been rebuilt. For now, I left the stopped instance in AWS in case we need to revert for whatever reason. We can delete it once we're confident that the new cache server is functioning.
A few items that came up during the process:
Cache server update seems successful so far, closing.
With an eye towards cost-reducing our CI spend, I think the first task would be to collect a survey of where the AWS money is going.
The initial step is to check cost for EC2 hours vs S3 vs I/O costs, etc.
Assuming the big-ticket item is EC2 hours, then we try to break that down by category: Experimental vs Continuous vs Nightly, Provisioned vs Unprovisioned, Everything vs OSS, etc. We think maybe Jenkins has logs / timings how long each build ran for, which is a fair approximation of EC2 hours. Also we have some 24/7 servers (i.e., not build runners) that should appear in the tally.
The victory condition here is a write-up that identifies what are the biggest-ticket items driving our CI spend.