Closed lox closed 6 years ago
05:26:13
{"instanceId":"i-022df5779e8fcc9d2","level":"debug","listener":"autoscaling","msg":"Polling sqs for messages","queueURL":"https://sqs.us-east-1.amazonaws.com/172840064832/lifecycled-i-022df5779e8fcc9d2","time":"2018-10-04T05:26:13Z"}
05:26:14
{"instanceId":"i-022df5779e8fcc9d2","level":"debug","listener":"spot","msg":"Polling ec2 metadata for spot termination notices","time":"2018-10-04T05:26:14Z"}
05:26:17
{"instanceId":"i-022df5779e8fcc9d2","level":"debug","listener":"autoscaling","msg":"Received an SQS message","subject":"Auto Scaling: Lifecycle action 'TERMINATING' for instance i-022df5779e8fcc9d2 in progress.","time":"2018-10-04T05:26:17Z","type":"Notification"}
05:26:17
{"arn":"arn:aws:sns:us-east-1:172840064832:lifecycled-example-lifecycle:87cf3edc-90ce-45a1-98e5-26f757866f5e","instanceId":"i-022df5779e8fcc9d2","level":"debug","listener":"autoscaling","msg":"Deleting sns subscription","time":"2018-10-04T05:26:17Z"}
05:26:17
{"instanceId":"i-022df5779e8fcc9d2","level":"info","msg":"Received termination notice","notice":"autoscaling","time":"2018-10-04T05:26:17Z"}
05:26:17
{"instanceId":"i-022df5779e8fcc9d2","level":"info","listener":"spot","msg":"Stopped listener","time":"2018-10-04T05:26:17Z"}
05:26:18
{"instanceId":"i-022df5779e8fcc9d2","level":"debug","listener":"autoscaling","msg":"Deleting sqs queue","queue":"lifecycled-i-022df5779e8fcc9d2","time":"2018-10-04T05:26:18Z"}
05:26:18
{"instanceId":"i-022df5779e8fcc9d2","level":"info","listener":"autoscaling","msg":"Stopped listener","time":"2018-10-04T05:26:18Z"}
05:26:18
{"instanceId":"i-022df5779e8fcc9d2","level":"info","msg":"Executing handler","notice":"autoscaling","time":"2018-10-04T05:26:18Z"}
05:26:28
{"instanceId":"i-022df5779e8fcc9d2","level":"debug","msg":"Sending heartbeat","notice":"autoscaling","time":"2018-10-04T05:26:28Z"}
05:26:38
{"instanceId":"i-022df5779e8fcc9d2","level":"debug","msg":"Sending heartbeat","notice":"autoscaling","time":"2018-10-04T05:26:38Z"}
05:26:48
{"instanceId":"i-022df5779e8fcc9d2","level":"debug","msg":"Sending heartbeat","notice":"autoscaling","time":"2018-10-04T05:26:48Z"}
05:26:58
{"instanceId":"i-022df5779e8fcc9d2","level":"debug","msg":"Sending heartbeat","notice":"autoscaling","time":"2018-10-04T05:26:58Z"}
05:27:08
{"instanceId":"i-022df5779e8fcc9d2","level":"debug","msg":"Sending heartbeat","notice":"autoscaling","time":"2018-10-04T05:27:08Z"}
05:27:18
{"instanceId":"i-022df5779e8fcc9d2","level":"debug","msg":"Sending heartbeat","notice":"autoscaling","time":"2018-10-04T05:27:18Z"}
05:27:28
{"instanceId":"i-022df5779e8fcc9d2","level":"debug","msg":"Sending heartbeat","notice":"autoscaling","time":"2018-10-04T05:27:28Z"}
05:27:38
{"instanceId":"i-022df5779e8fcc9d2","level":"debug","msg":"Sending heartbeat","notice":"autoscaling","time":"2018-10-04T05:27:38Z"}
05:27:48
{"instanceId":"i-022df5779e8fcc9d2","level":"debug","msg":"Sending heartbeat","notice":"autoscaling","time":"2018-10-04T05:27:48Z"}
05:27:58
{"instanceId":"i-022df5779e8fcc9d2","level":"debug","msg":"Sending heartbeat","notice":"autoscaling","time":"2018-10-04T05:27:58Z"}
05:28:08
{"instanceId":"i-022df5779e8fcc9d2","level":"debug","msg":"Sending heartbeat","notice":"autoscaling","time":"2018-10-04T05:28:08Z"}
05:28:18
{"instanceId":"i-022df5779e8fcc9d2","level":"debug","msg":"Sending heartbeat","notice":"autoscaling","time":"2018-10-04T05:28:18Z"}
05:28:18
{"instanceId":"i-022df5779e8fcc9d2","level":"info","msg":"Lifecycle action completed successfully","notice":"autoscaling","time":"2018-10-04T05:28:18Z"}
05:28:18
{"duration":"2m0.252715704s","instanceId":"i-022df5779e8fcc9d2","level":"info","msg":"Handler finished succesfully","notice":"autoscaling","time":"2018-10-04T05:28:18Z"}
I would have thought this should happen after the handler script finishes in case it fails. Was that intentional? We don't document anywhere how a failure from the handler script should run, based on the current docs I'd probably just assume it will complete the termination even on failure. Thoughts?
It was intentional; even if the handler script fails, we don't have any options since ABANDON
or CONTINUE
will both proceed with the termination. And even if we could ABANDON
, that would just leave us with instances whose handlers fail to execute. I.e. I don't think we have any recourse if the handler script fails, so we might as well clean up immediately since we know that the instance will be terminated anyway?
If you want to retry the handler, we could retry the handler if it exits with an error, but that would require that the handler script is idempotent, so I think it's best that any retry-logic is done by the user directly in their script.
I'd probably just assume it will complete the termination even on failure
The lifecycle action will complete on handler failure, and since we are using CommandContext
it should also complete if lifecycled
is currently executing the handler but receives a SIGTERM
. This is good in the following cases:
If someone hits terminate in the console the autoscaling group will still be waiting for the lifecycle action to be completed until the heartbeat timeout. So its good to just complete it when we receive a SIGTERM.
The maximum amount of time that you can keep an instance in a wait state is 48 hours or 100 times the heartbeat timeout, whichever is smaller. Source: https://docs.aws.amazon.com/autoscaling/ec2/userguide/lifecycle-hooks.html
In our example we have 1min heartbeat timeout, which allows us to stall the termination a max of 1h and 40min. If the handler executes longer than that, lifecycled
will receive a SIGTERM
and the whole machine has an indeterminable (not documented) amount of time to complete shutdown. As far as I've read, it seems like ~30 seconds after the SIGTERM
the machine is killed outright.
All in all, I've tried to make it so that the subscription and queue are guaranteed to be cleaned up no matter if the handler script succeeds/fails, we get a SIGTERM
or just otherwise fail to execute - and as fast as possible to avoid ending up in situations where there is no time to clean up. So starting the cleanup immediately when receiving a termination notice seemed prudent since we don't have any recourse and we know that the instance will be terminated in the end anyway.
Yup, I think I agree, just wanted to make sure it was intentional!
@itsdalmo I noticed that we now delete the SQS queue as soon as we get a termination notice vs after the handler script. I would have thought this should happen after the handler script finishes in case it fails. Was that intentional? We don't document anywhere how a failure from the handler script should run, based on the current docs I'd probably just assume it will complete the termination even on failure. Thoughts?