cloudfoundry / bosh-google-cpi-release

BOSH Google CPI
Apache License 2.0
63 stars 96 forks source link

bosh delete-env deleted environment, but failed #204

Closed akshaymankar closed 6 years ago

akshaymankar commented 7 years ago

We use BOSH director 262.0.0 and CPI 25.9.0.

We were deleting environment and got following issue.

Starting registry... Finished (00:00:00)

Started deleting deployment
  Waiting for the agent on VM 'vm-9ead214f-aff0-43ae-6afb-3108afba8232'... Finished (00:00:00)
  Stopping jobs on instance 'unknown/0'... Finished (00:00:00)
  Unmounting disk 'disk-be543a2e-ecd4-4a02-787b-fc3a9d3a7910'... Finished (00:00:06)
  Deleting VM 'vm-9ead214f-aff0-43ae-6afb-3108afba8232'... Failed (00:02:33)
Failed deleting deployment (00:02:47)

Stopping registry... Finished (00:00:00)
Cleaning up rendered CPI jobs... Finished (00:00:00)
Deleting deployment:
  Deleting vm in the cloud:
    CPI 'delete_vm' method responded with error: CmdError{"type":"Bosh::Clouds::CloudError","message":"Deleting vm 'vm-9ead214f-aff0-43ae-6afb-3108afba8232': Failed to delete Google Instance 'vm-9ead214f-aff0-43ae-6afb-3108afba8232': Google Operation 'operation-1499169750738-5537ca8736451-48154fce-c7152856' finished with an error: The resource 'projects/cf-pcf-kubo/zones/us-east1-c/instances/vm-9ead214f-aff0-43ae-6afb-3108afba8232' was not found\n","ok_to_retry":false}

Exit code 1

Director instance was deleted, but delete-env command failed.

tcdowney commented 7 years ago

Similarly, running delete-env when the stemcell image has already been deleted results in:

Deleting stemcell from cloud:
  CPI 'delete_stemcell' method responded with error: CmdError{"type":"Bosh::Clouds::CloudError","message":"Deleting stemcell 'stemcell-936d63e1-ffb4-45c5-6093-b0ebbf177e07': Google Image 'stemcell-936d63e1-ffb4-45c5-6093-b0ebbf177e07' does not exists: \u003cnil cause\u003e","ok_to_retry":false}

Exit code 1

Instead of failing, I would expect the delete stemcell step to no-op.

johnsonj commented 7 years ago

related/'dupe-ish': #162

Let me look into this. We don't want to get in the business of ignoring errors or explicitly managing recovery of missing resources in the CPI. That's more of a director concern (eg bosh cck), but I can see where this state is a pain and easy to get into when developing.

johnsonj commented 7 years ago

@akshaymankar - Old bug I know- but: In this instance was the referenced VM deleted out of band?

@tcdowney - For this issue I don't believe there's anything we can responsibly do in the CPI. The BOSH director doesn't tell the CPI that the deployment is going away, it tells us to delete_stemcell. Ignoring all 'resource does not exist' errors in response to delete_<..> calls from the director opens us up to scenarios where the director is corrupt/confused and tells us to delete VMs by the wrong name, we say 'uhh sure it's gone' and entropy increases instead of failing early and fast.

I'd try this issue with bosh-director to see if they want to support some sort of 'ignorable errors'.

johnsonj commented 6 years ago

Closing due to age and lack of action-ability. Please re-open if this problem crops up.