MozillaFoundation / plan

What the MoFo production teams are working on
https://build.mozillafoundation.org
18 stars 4 forks source link

Reduce run costs on AWS and Heroku #366

Closed simonwex closed 8 years ago

simonwex commented 9 years ago

We suspect our monthly AWS and Heroku costs are higher than required. Let's do an inventory of spending across each of our properties and propose (when required) or implement optimizations/reductions.

Solution vectors:

Phase: Build / Ship Owner: @jdotpz
Decision: @simonwex Lead design: N/A Dev: @jbuck Quality: @simonwex

jdotpz commented 9 years ago

I've trimmed off a bunch of unused volumes, and have killed four RDS instances that we don't really need or don't know what they're doing (phabricator). We of course took final snapshots, so if we hear any screams, we'll be able to restore.

The plan moving forward: 1) Combine nearly all staging EC2 nodes into a big honkin' cluster(maybe 2), with separate ELBs pointing to distinct node ports per app. 2) Put some selected apps (based on priority, current load averages, and frequency of updates recently) into smaller clusters. IE, Badgekit, Badgekit-api, Openbadges into a cluster, maybe Goggles, Thimble, Butter into a cluster. Meanwhile, we'll keep central/often changed apps separate, such as Webmakerorg, Makeapi, Login. 3) Reserve a swath of instances for these new clusters with the no-up-front cost option recently announced by AWS. 4) Sip champagne on a beach.

simonwex commented 9 years ago

@jdotpz could you post an update? How was the champagne?

jdotpz commented 9 years ago

The manifests should be set to reduce a bunch of our apps into clusters, but this week will see me launching first staging then production versions of that.

Meanwhile, we've scaled a ton of instances down last week.

I'm working on the costing server but so far it seems hosed still.

The app sounds like it should be working fine 2015-03-16 16:30:26,643 [cost_daily_elasticache] INFO basic.BasicDataManager - cost_daily_elasticache start polling... 2015-03-16 16:30:26,643 [cost_monthly_elasticache] INFO basic.BasicDataManager - cost_monthly_elasticache start polling... 2015-03-16 16:30:32,974 [usage_monthly_glacier] INFO basic.BasicDataManager - usage_monthly_glacier start polling... 2015-03-16 16:30:32,975 [usage_hourly_glacier] INFO basic.BasicDataManager - usage_hourly_glacier start polling... 2015-03-16 16:30:32,975 [usage_daily_glacier] INFO basic.BasicDataManager - usage_daily_glacier start polling... 2015-03-16 16:30:32,975 [cost_monthly_glacier] INFO basic.BasicDataManager - cost_monthly_glacier start polling... 2015-03-16 16:30:32,975 [usage_weekly_glacier] INFO basic.BasicDataManager - usage_weekly_glacier start polling...

I've got a new run going today which should take a few hours, and I'll know more.

jdotpz commented 9 years ago

Here's so far (without cluster benefits or reserved instances yet) (~14% cost reduction)

screen shot 2015-03-16 at 12 02 21 pm

screen shot 2015-03-16 at 12 02 38 pm

jdotpz commented 9 years ago

Today's update: wmappcluster and wmapp2cluster are live, combining: goggles, popcorn, thimble, webmaker-profile2(wmappcluster), wmscreenshot, events, events-api, and wmpublisher(wmapp2cluster). All the deploy scripts have been repointed.

screen shot 2015-03-20 at 9 42 23 am

jdotpz commented 9 years ago

We're at about 25% savings so far from where we started. Next week, I'll knock it down a bunch more (or over the weekend) by combining 5 apps into 2 clusters (badgekit, badgekit-api, openbadges, badgekit-mozilla, badgekit-api-mozilla). Then, I can likely cluster another number apps together in the same cluster architecture, and we'll be able to do at least some reserved instances for some really big cost reductions.

simonwex commented 9 years ago

Awesome, thanks for pushing on this, @jdotpz!

jdotpz commented 9 years ago

This morning I migrated all the badging infra to the new clusters, and have scaled down the old clusters. I'll get a cost update for this change tomorrow when our costing bucket gets its files, and I'll have my eye on this infra today.

Over the weekend, I got alerted that we were filling up logs. wmappcluster was storing a couple gigs of logs for thimble and goggles, and put together, it was bringing the drive to 97%. Since we send syslog and app logs to loggins, I trimmed down all app rsyslog configs to be "delete each day".

jdotpz commented 9 years ago

screen shot 2015-03-24 at 9 46 50 am

jdotpz commented 9 years ago

So far, ~26% reduction in cost prior to reserving instances.

hannahkane commented 9 years ago

Taking this ticket out of the March 13 milestone, since it's passed.

hannahkane commented 8 years ago

@jdotpz @simonwex - is this ticket still useful? should we assign to a 2016 milestone?