AWS - VPC deletion fails because of dependencies but deletes just fine in the AWS console

nwmahoney commented 5 years ago

Screen Shot 2019-06-07 at 10 02 25 AM

Deleting this VPC worked just fine in the AWS console using the same account.

genevieve commented 5 years ago

Hey Nick!

QQs:

Did you try running leftovers more than once?
Were there possibly any dependencies that you had deleted (like a vm in that vpc) that could have taken a while to be truly marked as "deleted" by aws?

Occasionally, resources are still considered dependencies because the iaas can take a while to consider them deleted. This is more of a problem when you deleted something in the aws console like the vm, then tried to delete the network with leftovers because if you try to delete the vm with leftovers first -> leftovers will wait until the vm is truly marked as gone before trying to delete the vpc vs if you try with the console first, the status of the vm will be "deleting" and leftovers won't try to delete it itself and then proceed to the vpc.

nwmahoney commented 5 years ago

We ran leftovers thrice in the pipeline (this build and the previous two), and once manually. Then we deleted in the console right after that and it worked. We hadn't done anything manual, and I think the pipeline just interacts with AWS through bbl and leftovers. I don't think there was any manual deletion.

nwmahoney commented 5 years ago

P.S. Hi Gen!

genevieve commented 5 years ago

Hi Nick!!!

Alright, what I'm gathering:

The first failure is in build #147 where bosh fails to delete the jumpbox.
The next failure is in build #148 where leftovers reports that it deleted the jumpbox, but then proceeds to fail trying to delete the vpc
The next failure is in build #149, about 40 minutes later, and proceeds to fail to delete the vpc.
The next failure is in build #150, about 30 minutes later, and proceeds to fail to delete the vpc.
When you delete the vpc in the console successfully, the next build passes.

Since this is the first time this is happening, I'm wondering what conditions make this situation unique. Are there perhaps any new resource types that we are creating in the bbl aws terraform templates that leftovers doens't know to delete?

The aws api is returning a 400 saying there are dependencies, but I know that from the console, you are allowed to delete it regardless given those dependencies are certain types.

Maybe a question we can get an answer to is: can the aws api give us a better error message about what dependencies exist?

Alternatively, we could narrow down to what dependencies can be deleted by deleting the vpc in the console that you can't delete by deleting the vpc by using the api.

rowanjacobs commented 5 years ago

Looks like this is still happening. Can you post the output of aws ec2 describe-subnets --filters "Name=vpc-id,Values=vpc-0aa226bc62915c540"? @nmahoney-pivotal

genevieve commented 5 years ago

It's interesting that it always happens after a bosh failure like this...

rowanjacobs commented 5 years ago

That link doesn't work! 😔 But I think I can see the one you're referring to from earlier concourse logs.

This is beginning to remind me of the early bbl on OpenStack failures where incompletely creating a VM would result in a "port" (OpenStack static private IP allocation) being left behind without a VM attached to it. I wonder if there was a recent AWS CPI change that created a similar kind of shadow resource that doesn't get cleaned up by leftovers, and if so what that would be.

EDIT: on looking at the logs again I actually can't tell if BOSH even tried to create a VM. So maybe this whole hypothesis is entirely off-base and there's actually a Terraform problem.

nwmahoney commented 5 years ago

@rowanjacobs I thought I sent this before... I guess not. Here's that output:

$ aws ec2 describe-subnets --filters "Name=vpc-id,Values=vpc-0aa226bc62915c540"
{
    "Subnets": []
}

genevieve commented 4 years ago

Closing this issue until (if) we see it again and can access the environment to debug.

genevieve commented 4 years ago

It appears that occasionally there are load-balancers deployed to a network that do not contain the full filtering string. For instance, bbl sometimes crops the environment name when creating resources to fit the length limit when needed. https://github.com/cloudfoundry/bosh-bootloader/blob/a1f38c83bd02f71bab4dea46ce4cae86336969ff/terraform/aws/templates/concourse_lb.tf#L56

Leftovers doesn't return them in the list because they do not contain the full filter string like bump-deployments-aws-concourse since the load balancer might be created with just bump-deplo as prefix.

Working to see if we can use the vpcId or subnetId to delete these resources.

genevieve / leftovers

AWS - VPC deletion fails because of dependencies but deletes just fine in the AWS console #86