metal_device wait_for_deprovision can be raced with a new instance claiming the hardware reservation

equinix / terraform-provider-metal

DEPRECATED Equinix Metal standalone TF provider. Use the unified Equinix Terraform Provider.

https://registry.terraform.io/providers/equinix/equinix/latest/docs/guides/migration_guide_equinix_metal

Mozilla Public License 2.0

14 stars 11 forks source link

metal_device wait_for_deprovision can be raced with a new instance claiming the hardware reservation #208

Closed displague closed 2 years ago

displague commented 2 years ago

It is possible (seen by real users) for a wait_for_deprovision enabled metal_device to be destroyed and then reclaimed by an external tool BEFORE the wait_for_deprovision destroy loop detects that the device was provisionable. The machine is then unprovisionable and the full 1hr timeout must be awaited.

Rather than waiting for provisionable=true, the RefreshStateFunc could wait for the instance_id of the reservation to become empty and for provisionable flag to be true. The known instance_id would be used in the Pending slice of strings that this helper function takes, with a Target of "". If an unknown instance_id is returned by the function, Terraform would fail because of the unexpected value.

enkelprifti98 commented 2 years ago

Adding my description here as well.

Terraform loops forever until the 1 hour timeout when deleting instances utilizing hardware_reservation_id and wait_for_reservation_deprovision:

I think we might be looking at some type of race condition here where you are issuing a new instance provision to the same hardware reservation as soon as it becomes provisionable. The old TF delete process for the old instance running on the same hardware reservation still sees the hardware reservation as NOT provisionable since it is now provisioning with a new instance. This means that it missed that short window/race condition where the instance was marked as provisionable right after it completed the deletion process.

Terraform checks the hardware reservation state during the old instance delete process to see if it is provisionable and the new instance provision takes over the same hardware reservation and changes the provisonable boolean to false. So the hardware reservation provisionable boolean goes from false -> true -> false during that race window and the old instance TF delete process loops forever checking if the hardware reservation is provisionable until the 1 hour timeout.

metal_device.nodes[0]: Still destroying... [id=0bc3d567-7248-4664-890d-05420312e91f, 59m50s elapsed]
metal_device.nodes[0]: Still destroying... [id=0bc3d567-7248-4664-890d-05420312e91f, 1h0m0s elapsed]
╷
│ Error: timeout while waiting for state to become 'true' (last state: 'false', timeout: 1h0m0s)
│
│
╵

enkelprifti98 commented 2 years ago

Here's some info on the states of a hardware reservation's lifecycle:

This is a hardware reservation in a provisionable state with no instance.

  "device": null,
  "provisionable": true,

This is a hardware reservation when a user has kicked off an instance provision until the instance/device has been marked as deleted.

  "device": {
    "href": "/devices/0bc3d567-7248-4664-890d-05420312e91f"
  },
  "provisionable": false,

This is the same hardware reservation right after the user has marked the instance for deletion:

  "device": null,
  "provisionable": false,

The user cannot get info about the deleted instance anymore:

{
  "errors": [
    "Not found"
  ]
}

@displague Most important thing to note here based on your proposal is that the instance/device ID gets wiped from the hardware reservation metadata right after users mark the instance for deletion.

enkelprifti98 commented 2 years ago

I was finally able to replicate the race condition / instance deletion timeout issue. My steps are the following:

Provision metal_device resource with hardware_reservation_id and wait_for_reservation_deprovision parameters.
Run terraform destroy to delete the instance/device.
While terraform is waiting for the hardware reservation provisionable boolean to be set to true, I repeatedly try to manually provision a new instance/device on the same hardware_reservation_id via the API/cURL.
If the manual API call was sent just BEFORE terraform saw the provisionable boolean be set to true, terraform will continue to check for it to be set to true until it gets to the 1 hour timeout.

It’s pretty difficult to replicate this since terraform seems to make checks for the provisionable boolean fairly frequently and I didn’t even realize I had beat terraform while trying to provision via the API/cURL manually until i saw terraform past the 10 minute mark still waiting for the old device / reservation to be destroyed and i checked the portal and a new device was being created.

enkelprifti98 commented 2 years ago

My fix proposal would be to check if either of these conditions is true:

( provisionable=true ) OR ( [old_deviceID != new_deviceID] && [new_deviceID != null] && [provisionable=false] )

The first condition is what we have today. The second condition checks if there has been a new instanceID/deviceID (null cannot be used) deployed on the same hardware_reservation_id. A new instance ID which terraform is not aware of that is running on the same hardware reservation effectively means that the hardware reservation became provisionable during that race condition window.

displague commented 2 years ago

@enkelprifti98 pointed out the following in Slack:

Why not just return a successful answer if it gets a new id? A new id and provisionable = false means that the hardware completed deprovisioning at some point in that race window

That makes sense. If a new id is detected, this is not an error. Deletion was successful, exit loop. The next provision attempt (if Terraform wants to make one) may discover that the reservation is already in use, but that's not Terraform's problem to solve.

displague commented 2 years ago

Reopening because we have seen a recurrence of the problem with v3.2.2. Awaiting more feedback.

metal_device.reserved: Still destroying... [id=..., 1h+ elapsed]
╷
│ Error: timeout while waiting for state to become 'provisionable, reprovisioned' (last state: 'deprovisioning', timeout: 1h0m0s)

ocobles commented 2 years ago

It is hard to replicate this error and we are not collecting enough information to understand what is being returned from the API. Unfortunately, stateConf.WaitForState() will no return additional info if there is a timeout so instead I suggest adding a default in the switch-case block to trace what is being returned.

https://github.com/equinix/terraform-provider-metal/blob/edc45fee397d331d03bdd3722b52641cd4354936/metal/helpers_device.go#L125-L134

switch {
        case err != nil:
            err = friendlyError(err)
            state = errstate
        case r != nil && r.Provisionable:
            state = provisionable
        case r != nil && r.Device != nil && (r.Device.ID != "" && r.Device.ID != instanceId):
            log.Printf("[WARN] Equinix Metal device instance %s (reservation %s) was reprovisioned to a another instance (%s)", instanceId, reservationId, r.Device.ID)
            state = reprovisioned
        default:
               //TRACE print any other behavoir + r 
        }

ocobles commented 2 years ago

https://github.com/equinix/terraform-provider-metal/blob/edc45fee397d331d03bdd3722b52641cd4354936/metal/helpers_device.go#L123

I'm not sure if this is how it works, but at least from API you need to explicitly set param include = device because it just returns href by default. I guess we will need to add GetOptions.Includes here otherwise it won't reach that reprovisioned case since probably r.Device.ID is null

enkelprifti98 commented 2 years ago

My suspicion is along the same line that we're not properly getting the raw ID of the new instance. You can look at the default API behavior here:

https://github.com/equinix/terraform-provider-metal/issues/208#issuecomment-1009259174