equinix / terraform-provider-equinix

Terraform Equinix provider
https://deploy.equinix.com/labs/terraform-provider-equinix/
MIT License
45 stars 45 forks source link

Await custom user state when provisioning Equinix Metal Devices #214

Open displague opened 2 years ago

displague commented 2 years ago

Rather than relying on Equinix Metal's reported device state, users may wish to take advantage of custom states, reported through POSTing to metadata to signify when Terraform should consider a node active.

https://metal.equinix.com/developers/docs/server-metadata/user-state/

For example:

resource "metal_device" "foo" {
  behaviors {
    wait_for_userstate = {state: succeeded, code: 1234}
  }
} 

With a definition like this, rather than waiting for the "status: active" poll, Terraform would wait for an event to be received on for this device with the matching userstate.

Within the server, the user would trigger this event via userdata or SSH provisioning:

curl -s -X POST -d '{"state":"succeeded", code: 1234, message:{"testly":"test"}}' $(curl -s https://metadata.platformequinix.com/metadata | jq -r .user_state_url)

message can be any JSON blob, I believe.

It's not clear to me what format Terraform would accept to allow for user-state event matching. Is state enough? code? code + state? message? code + state + message? Do we limit Terraform to matching on text messages or should we match the whole message object or subentities within it?

displague commented 2 years ago

If this feature is implemented, it should be noted as best practice whenever userdata is specified and network mode will be converted to Layer2 only.

displague commented 2 years ago

wait_for_userstate = {state: succeeded, code: 1234}

@cprivitere points out that it would be necessary for CloudInit to report failure too. The wait_for parameter would need to be aware of failure conditions/states.

A failed CloudInit would result in a failed provision. Should Terraform leave the machine up? Do we want another behavior to define this?

Perhaps:

wait_for_userstate = { userstate: {state: succeeded, code: 1234}, <something to define failure behavior> }
displague commented 2 years ago

A timeout based failure would be triggered if the userstate was not pushed in the create timeout window.

ctreatma commented 1 year ago

Is there an API endpoint for retrieving userstate for a device? I don't see it in the docs.

displague commented 1 year ago

@ctreatma it's part of the device events feed: /metal/v1/devices/.../events?&per_page=25&page=1

The event will have a type that identifies it as a user-state event.

displague commented 1 month ago

I was curious how the Portal polls and reports device movement across states. It only polls the device/{id} endpoint, with no special include arguments and it reports back the provisioning_events (array of events):

The final event reported for a L3 device (not sure about L2) is:

{
    "id": null,
    "type": "provisioning.110",
    "body": "Device phoned home and is ready to go",
    "state": null,
    "created_at": null,
    "modified_by": null,
    "relationships": [],
    "ip": null,
    "interpolated": "Device phoned home and is ready to go"
}
displague commented 1 month ago

I don't know if userstate events also appear in that list, but, however sourced, a wait_for_userstate and wait_for_provisioning_event would function similarly.