defenseunicorns / delivery-aws-iac

Apache License 2.0
14 stars 5 forks source link

Periodically destroy all dev/test AWS resources #89

Closed RothAndrew closed 5 months ago

RothAndrew commented 1 year ago

Persona

I'm a maintainer of this repo. I'm submitting this on behalf of Defense Unicorns leadership, who want to ensure that the money we spend in our dev/test AWS account(s) is being spent well.

Description

Periodically (frequency TBD), automatically destroy all resources in our dev/test AWS account that aren't specifically identified as being permanent resources.

Use Case

This is needed because we frequently get orphaned resources in our AWS account. A big part of what we do is making rapid changes to Terraform code. We test those changes frequently, and when tests fail, there is a chance that the resources don't get cleaned up properly.

Impact

According to the billing console, the stuff that is running in the account right now is costing about $100 per day. I don't believe we have any tests actively running in the account right now, so the likelihood is that most of that $100 per day is from orphaned resources that haven't been cleaned up yet.

The impact is, that we continue to "light dollar bills on fire", or we force members of the team to continue to manually go through and delete resources, which is labor intensive and prone to mistakes.

Completion

Additional Context

image


Original description:

My session token expired in the middle of an apply and I lost the terraform state. I'm now going through and having to delete hundreds of things manually.

The AWS account we are using doesn't have anything permanent in it. We should set up the ability to nuke all resources in the account (with perhaps just a few exceptions, like the GitHub Actions auth provider and role)

https://github.com/rebuy-de/aws-nuke works well for this kind of thing.

RothAndrew commented 1 year ago

Had 2 more instances wince writing this where orphaned resources were created and I had to spend time deleting them manually

RothAndrew commented 1 year ago

https://github.com/lianghong/delete_vpc was helpful, though it didn't work out of the box, I had to update the script a little bit to get it to work.

wirewc commented 1 year ago

Agreed it doesn't work out of the box, however it did help find the dependencies that did have issues. Ask Blake about the script we found that may help a bit more with deleting resources.

On Mon, Mar 13, 2023 at 4:35 PM Andy Roth @.***> wrote:

https://github.com/lianghong/delete_vpc was helpful, though it didn't work out of the box, I had to update the script a little bit to get it to work.

— Reply to this email directly, view it on GitHub https://github.com/defenseunicorns/iac/issues/89#issuecomment-1466918367, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAWK6CJ3F36HM6T7S5QNFP3W36AHHANCNFSM6AAAAAAVMYNPNM . You are receiving this because you are subscribed to this thread.Message ID: @.***>

ntwkninja commented 1 year ago

+1 for the periodic nuke

recommend having people tag resources with narwhal-<person's name> and we exclude those from the nuke

we'll also need to add a periodic shaming to ask if peeps have things that aren't actively being developed (possibly at the end of retro)

ntwkninja commented 1 year ago

we can also look into automatically suspending resources based on time of day (i.e. a certain list of resources tagged with narwhal-* get automatically suspended at 6pm MST)

ntwkninja commented 1 year ago

@mjnagel do you have insights into what BB / P1 is using to manage dev costs?

RothAndrew commented 1 year ago

+1 for the periodic nuke

recommend having people tag resources with narwhal-<person's name> and we exclude those from the nuke

we'll also need to add a periodic ~shaming~ to ask if peeps have things that aren't actively being developed (possibly at the end of retro)

Works for me, though let's use something more generic, like "is-keeper == true"

mjnagel commented 1 year ago

@mjnagel do you have insights into what BB / P1 is using to manage dev costs?

From a quick check with them sounds like kubecost/opencost somewhat for monitoring things to some extent and then homegrown scripts run on lambda for cleanup. They're looking at getting kubecost SMEs to help configure things better for them.

wirewc commented 1 year ago

@RothAndrew dumb question. Wipe every Monday at 1 am EST? Total account nuke no matter what? Or make it 3:30 am EST?

RothAndrew commented 1 year ago

@RothAndrew dumb question. Wipe every Monday at 1 am EST? Total account nuke no matter what? Or make it 3:30 am EST?

I don't particularly care how often it is done or what time of day it is done (as long as it is sometime in the dead of night for both east coast and west coast). If it doesn't cause significant disruptions we might as well do it daily.

If it does cause significant disruptions we shouldn't do it at all until we have figured out how to do it without causing significant disruptions.