Closed displague closed 2 years ago
Adding my description here as well.
Terraform loops forever until the 1 hour timeout when deleting instances utilizing hardware_reservation_id
and wait_for_reservation_deprovision
:
I think we might be looking at some type of race condition here where you are issuing a new instance provision to the same hardware reservation as soon as it becomes provisionable
. The old TF delete process for the old instance running on the same hardware reservation still sees the hardware reservation as NOT provisionable
since it is now provisioning with a new instance. This means that it missed that short window/race condition where the instance was marked as provisionable
right after it completed the deletion process.
Terraform checks the hardware reservation state during the old instance delete process to see if it is provisionable
and the new instance provision takes over the same hardware reservation and changes the provisonable boolean to false
. So the hardware reservation provisionable boolean goes from false
-> true
-> false
during that race window and the old instance TF delete process loops forever checking if the hardware reservation is provisionable
until the 1 hour timeout.
metal_device.nodes[0]: Still destroying... [id=0bc3d567-7248-4664-890d-05420312e91f, 59m50s elapsed]
metal_device.nodes[0]: Still destroying... [id=0bc3d567-7248-4664-890d-05420312e91f, 1h0m0s elapsed]
╷
│ Error: timeout while waiting for state to become 'true' (last state: 'false', timeout: 1h0m0s)
│
│
╵
Here's some info on the states of a hardware reservation's lifecycle:
This is a hardware reservation in a provisionable state with no instance.
"device": null,
"provisionable": true,
This is a hardware reservation when a user has kicked off an instance provision until the instance/device has been marked as deleted.
"device": {
"href": "/devices/0bc3d567-7248-4664-890d-05420312e91f"
},
"provisionable": false,
This is the same hardware reservation right after the user has marked the instance for deletion:
"device": null,
"provisionable": false,
The user cannot get info about the deleted instance anymore:
{
"errors": [
"Not found"
]
}
@displague Most important thing to note here based on your proposal is that the instance/device ID gets wiped from the hardware reservation metadata right after users mark the instance for deletion.
I was finally able to replicate the race condition / instance deletion timeout issue. My steps are the following:
metal_device
resource with hardware_reservation_id
and wait_for_reservation_deprovision
parameters.terraform destroy
to delete the instance/device.provisionable
boolean to be set to true
, I repeatedly try to manually provision a new instance/device on the same hardware_reservation_id
via the API/cURL.provisionable
boolean be set to true
, terraform will continue to check for it to be set to true until it gets to the 1 hour timeout.It’s pretty difficult to replicate this since terraform seems to make checks for the provisionable boolean fairly frequently and I didn’t even realize I had beat terraform while trying to provision via the API/cURL manually until i saw terraform past the 10 minute mark still waiting for the old device / reservation to be destroyed and i checked the portal and a new device was being created.
My fix proposal would be to check if either of these conditions is true:
( provisionable=true ) OR ( [old_deviceID != new_deviceID] && [new_deviceID != null] && [provisionable=false] )
The first condition is what we have today. The second condition checks if there has been a new instanceID/deviceID (null cannot be used) deployed on the same hardware_reservation_id. A new instance ID which terraform is not aware of that is running on the same hardware reservation effectively means that the hardware reservation became provisionable during that race condition window.
@enkelprifti98 pointed out the following in Slack:
Why not just return a successful answer if it gets a new id? A new id and provisionable = false means that the hardware completed deprovisioning at some point in that race window
That makes sense. If a new id is detected, this is not an error. Deletion was successful, exit loop. The next provision attempt (if Terraform wants to make one) may discover that the reservation is already in use, but that's not Terraform's problem to solve.
Reopening because we have seen a recurrence of the problem with v3.2.2. Awaiting more feedback.
metal_device.reserved: Still destroying... [id=..., 1h+ elapsed]
╷
│ Error: timeout while waiting for state to become 'provisionable, reprovisioned' (last state: 'deprovisioning', timeout: 1h0m0s)
It is hard to replicate this error and we are not collecting enough information to understand what is being returned from the API. Unfortunately, stateConf.WaitForState() will no return additional info if there is a timeout so instead I suggest adding a default
in the switch-case block to trace what is being returned.
switch {
case err != nil:
err = friendlyError(err)
state = errstate
case r != nil && r.Provisionable:
state = provisionable
case r != nil && r.Device != nil && (r.Device.ID != "" && r.Device.ID != instanceId):
log.Printf("[WARN] Equinix Metal device instance %s (reservation %s) was reprovisioned to a another instance (%s)", instanceId, reservationId, r.Device.ID)
state = reprovisioned
default:
//TRACE print any other behavoir + r
}
I'm not sure if this is how it works, but at least from API you need to explicitly set param include = device because it just returns href by default. I guess we will need to add GetOptions.Includes here otherwise it won't reach that reprovisioned case since probably r.Device.ID is null
My suspicion is along the same line that we're not properly getting the raw ID of the new instance. You can look at the default API behavior here:
https://github.com/equinix/terraform-provider-metal/issues/208#issuecomment-1009259174
It is possible (seen by real users) for a
wait_for_deprovision
enabledmetal_device
to be destroyed and then reclaimed by an external tool BEFORE thewait_for_deprovision
destroy loop detects that the device was provisionable. The machine is then unprovisionable and the full 1hr timeout must be awaited.Rather than waiting for
provisionable=true
, the RefreshStateFunc could wait for the instance_id of the reservation to become empty and for provisionable flag to be true. The known instance_id would be used in thePending
slice of strings that this helper function takes, with aTarget
of""
. If an unknown instance_id is returned by the function, Terraform would fail because of the unexpected value.