Add retry on instance-id fetching and queue delete

lox commented 6 years ago

This adds explicit retries to fetching the instance-id and also queue deletion.

We could use the built in retry gear in the aws-sdk, but this is slightly clearer IMO.

itsdalmo commented 6 years ago

In my opinion, I think this should only be implemented for the deferred calls to Delete and Unsubscribe. For getInstanceID I don't think it adds much value, since any error will cause lifecycled to exit with an error and then get restarted by systemd - which is very unlikely to cause problems since it only happens once on startup.

If you do decide to keep this implementation for getInstanceID I think the function should return string and not <-chan string (current code seems to be a mix of the two).

itsdalmo commented 6 years ago

Another thing worth considering is that the deferred call to Queue.Delete() happens after the lifecycle hook has been completed, and in my experience the signal is received pretty much immediately after the hook has been completed. When the signal is received, the subctx context will be cancelled (since it's a child of ctx), which means that it could be cancelled after just 1 attempt, or none at all.

I think it's nice to have the retries for unsubscribing and deleting the queue, but because of the issue discussed above I think we need to do some more work before merging this PR. IMO the best solution is to create a "happy path" for when a termination notice is received, which runs the handler and then starts the cleanup immediately after the handler has exited, and only completes the lifecycle hook after the cleanup is done.

This is what I ended up doing in my own experiments:

daemon.go returns completeFunc func() error.
main.go waits for daemon.Start() to return completeFunc and invokes it at a point where the queue has already been cleaned up.

I can submit a PR which carves out a "happy path" and let you evaluate!

lox commented 6 years ago

Closing for now, agree with your feedback. Hopefully without the open files bug we won't need a retry.

buildkite / lifecycled

Add retry on instance-id fetching and queue delete #40