"metal_device: Destruction complete", but device is not deleted

equinix / terraform-provider-equinix

Terraform Equinix provider

https://deploy.equinix.com/labs/terraform-provider-equinix/

MIT License

47 stars 45 forks source link

"metal_device: Destruction complete", but device is not deleted #179

Open jbg opened 3 years ago

jbg commented 3 years ago

Terraform Version

Terraform v0.15.5
on darwin_amd64
+ provider registry.terraform.io/equinix/metal v2.1.0

Affected Resource(s)

metal_device

Debug Output

This issue happened in production where we were not running with debug output. We have now temporarily enabled debug output on the affected system and will update this issue with debug logs if it happens again.

Expected Behavior

If the metal_device is successfully deleted, the actual device should be deleted at Equinix Metal. If the actual device at Equinix Metal is not successfully deleted, the metal_device should not be deleted, so that it remains in the state and deletion can be attempted again.

Actual Behavior

The metal_device was successfully deleted, but the device was not deleted at Equinix Metal.

Important Factoids

I can provide the name of the affected device, and the relevant timestamps, privately if it's useful.

displague commented 3 years ago

I'm marques.packet in the Equinix Metal Comunity Slack, @jbg.

Was this a reserved hardware instance? Could you provide any details from the device timeline if it is still available?

I can research the timeline if you don't have the message available to share on GH.

displague commented 3 years ago

Another possibility, was the resource locked?

jbg commented 3 years ago

It was an on demand instance. It has long since been deleted so we don’t have the timeline but I will search our TF logs to find the instance ID - just haven’t had a chance yet.

The instance wasn’t locked (unless there is some way it can become locked without us locking it via console or API) but anyway I would expect the TF provider to fail to delete in that case, rather than silently succeed and remove it from the state. There should ideally be no way for TF to think the instance is deleted when it is in fact still running.

Whatever the issue was, it must be intermittent, because the exact same provider version and TF config has created and deleted several more instances since then without hitting the issue. Does the provider handle HTTP errors correctly? Maybe there was a brief API outage when it tried to delete the server.

On 26 Jun 2021, at 00:44, Marques Johansson @.***> wrote:

Another possibility, was the resource locked?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

displague commented 3 years ago

Does the provider handle HTTP errors correctly? Maybe there was a brief API outage when it tried to delete the server.

Errors like a 403 and 404 are treated as a successful delete because they imply that the project or device were removed. It's possible that you received one of these errors for a different reason, API side or user side. An API Key rotation, perhaps?

jbg commented 3 years ago

API key was not rotated at any time near to this occurrence. There's no proxy involved our side, so any HTTP response was direct from Metal.