SQS queue deleted on TERMINATION notice vs after handler script

lox commented 6 years ago

@itsdalmo I noticed that we now delete the SQS queue as soon as we get a termination notice vs after the handler script. I would have thought this should happen after the handler script finishes in case it fails. Was that intentional? We don't document anywhere how a failure from the handler script should run, based on the current docs I'd probably just assume it will complete the termination even on failure. Thoughts?

lox commented 6 years ago


05:26:13
{"instanceId":"i-022df5779e8fcc9d2","level":"debug","listener":"autoscaling","msg":"Polling sqs for messages","queueURL":"https://sqs.us-east-1.amazonaws.com/172840064832/lifecycled-i-022df5779e8fcc9d2","time":"2018-10-04T05:26:13Z"}

05:26:14
{"instanceId":"i-022df5779e8fcc9d2","level":"debug","listener":"spot","msg":"Polling ec2 metadata for spot termination notices","time":"2018-10-04T05:26:14Z"}

05:26:17
{"instanceId":"i-022df5779e8fcc9d2","level":"debug","listener":"autoscaling","msg":"Received an SQS message","subject":"Auto Scaling: Lifecycle action 'TERMINATING' for instance i-022df5779e8fcc9d2 in progress.","time":"2018-10-04T05:26:17Z","type":"Notification"}

05:26:17
{"arn":"arn:aws:sns:us-east-1:172840064832:lifecycled-example-lifecycle:87cf3edc-90ce-45a1-98e5-26f757866f5e","instanceId":"i-022df5779e8fcc9d2","level":"debug","listener":"autoscaling","msg":"Deleting sns subscription","time":"2018-10-04T05:26:17Z"}

05:26:17
{"instanceId":"i-022df5779e8fcc9d2","level":"info","msg":"Received termination notice","notice":"autoscaling","time":"2018-10-04T05:26:17Z"}

05:26:17
{"instanceId":"i-022df5779e8fcc9d2","level":"info","listener":"spot","msg":"Stopped listener","time":"2018-10-04T05:26:17Z"}

05:26:18
{"instanceId":"i-022df5779e8fcc9d2","level":"debug","listener":"autoscaling","msg":"Deleting sqs queue","queue":"lifecycled-i-022df5779e8fcc9d2","time":"2018-10-04T05:26:18Z"}

05:26:18
{"instanceId":"i-022df5779e8fcc9d2","level":"info","listener":"autoscaling","msg":"Stopped listener","time":"2018-10-04T05:26:18Z"}

05:26:18
{"instanceId":"i-022df5779e8fcc9d2","level":"info","msg":"Executing handler","notice":"autoscaling","time":"2018-10-04T05:26:18Z"}

05:26:28
{"instanceId":"i-022df5779e8fcc9d2","level":"debug","msg":"Sending heartbeat","notice":"autoscaling","time":"2018-10-04T05:26:28Z"}

05:26:38
{"instanceId":"i-022df5779e8fcc9d2","level":"debug","msg":"Sending heartbeat","notice":"autoscaling","time":"2018-10-04T05:26:38Z"}

05:26:48
{"instanceId":"i-022df5779e8fcc9d2","level":"debug","msg":"Sending heartbeat","notice":"autoscaling","time":"2018-10-04T05:26:48Z"}

05:26:58
{"instanceId":"i-022df5779e8fcc9d2","level":"debug","msg":"Sending heartbeat","notice":"autoscaling","time":"2018-10-04T05:26:58Z"}

05:27:08
{"instanceId":"i-022df5779e8fcc9d2","level":"debug","msg":"Sending heartbeat","notice":"autoscaling","time":"2018-10-04T05:27:08Z"}

05:27:18
{"instanceId":"i-022df5779e8fcc9d2","level":"debug","msg":"Sending heartbeat","notice":"autoscaling","time":"2018-10-04T05:27:18Z"}

05:27:28
{"instanceId":"i-022df5779e8fcc9d2","level":"debug","msg":"Sending heartbeat","notice":"autoscaling","time":"2018-10-04T05:27:28Z"}

05:27:38
{"instanceId":"i-022df5779e8fcc9d2","level":"debug","msg":"Sending heartbeat","notice":"autoscaling","time":"2018-10-04T05:27:38Z"}

05:27:48
{"instanceId":"i-022df5779e8fcc9d2","level":"debug","msg":"Sending heartbeat","notice":"autoscaling","time":"2018-10-04T05:27:48Z"}

05:27:58
{"instanceId":"i-022df5779e8fcc9d2","level":"debug","msg":"Sending heartbeat","notice":"autoscaling","time":"2018-10-04T05:27:58Z"}

05:28:08
{"instanceId":"i-022df5779e8fcc9d2","level":"debug","msg":"Sending heartbeat","notice":"autoscaling","time":"2018-10-04T05:28:08Z"}

05:28:18
{"instanceId":"i-022df5779e8fcc9d2","level":"debug","msg":"Sending heartbeat","notice":"autoscaling","time":"2018-10-04T05:28:18Z"}

05:28:18
{"instanceId":"i-022df5779e8fcc9d2","level":"info","msg":"Lifecycle action completed successfully","notice":"autoscaling","time":"2018-10-04T05:28:18Z"}

05:28:18
{"duration":"2m0.252715704s","instanceId":"i-022df5779e8fcc9d2","level":"info","msg":"Handler finished succesfully","notice":"autoscaling","time":"2018-10-04T05:28:18Z"}

itsdalmo commented 6 years ago

I would have thought this should happen after the handler script finishes in case it fails. Was that intentional? We don't document anywhere how a failure from the handler script should run, based on the current docs I'd probably just assume it will complete the termination even on failure. Thoughts?

It was intentional; even if the handler script fails, we don't have any options since ABANDON or CONTINUE will both proceed with the termination. And even if we could ABANDON, that would just leave us with instances whose handlers fail to execute. I.e. I don't think we have any recourse if the handler script fails, so we might as well clean up immediately since we know that the instance will be terminated anyway?

If you want to retry the handler, we could retry the handler if it exits with an error, but that would require that the handler script is idempotent, so I think it's best that any retry-logic is done by the user directly in their script.

I'd probably just assume it will complete the termination even on failure

The lifecycle action will complete on handler failure, and since we are using CommandContext it should also complete if lifecycled is currently executing the handler but receives a SIGTERM. This is good in the following cases:

Someone hits terminate in the console:

If someone hits terminate in the console the autoscaling group will still be waiting for the lifecycle action to be completed until the heartbeat timeout. So its good to just complete it when we receive a SIGTERM.

The handler has executed long enough that we have hit the max limit for the wait state:

The maximum amount of time that you can keep an instance in a wait state is 48 hours or 100 times the heartbeat timeout, whichever is smaller. Source: https://docs.aws.amazon.com/autoscaling/ec2/userguide/lifecycle-hooks.html

In our example we have 1min heartbeat timeout, which allows us to stall the termination a max of 1h and 40min. If the handler executes longer than that, lifecycled will receive a SIGTERM and the whole machine has an indeterminable (not documented) amount of time to complete shutdown. As far as I've read, it seems like ~30 seconds after the SIGTERM the machine is killed outright.

All in all, I've tried to make it so that the subscription and queue are guaranteed to be cleaned up no matter if the handler script succeeds/fails, we get a SIGTERM or just otherwise fail to execute - and as fast as possible to avoid ending up in situations where there is no time to clean up. So starting the cleanup immediately when receiving a termination notice seemed prudent since we don't have any recourse and we know that the instance will be terminated in the end anyway.

lox commented 6 years ago

Yup, I think I agree, just wanted to make sure it was intentional!

buildkite / lifecycled

SQS queue deleted on TERMINATION notice vs after handler script #55

Someone hits terminate in the console:

The handler has executed long enough that we have hit the max limit for the wait state: