Closed peter-svensson closed 5 years ago
+1
The difference between shutting down lifecycled manually (with ctrl-c):
+ lifecycled --sns-topic arn:aws:sns:us-east-1:731841313232:drone-agent-lifecycle --handler /cleanup.sh --debug
INFO[0000] Looking up instance id from metadata service
DEBU[0000] Creating sqs queue queue=lifecycled-i-0f5c95a0d2569a5b6
DEBU[0000] Looking up sqs queue url queue=lifecycled-i-0f5c95a0d2569a5b6
DEBU[0000] Subscribing queue to sns topic queue=lifecycled-i-0f5c95a0d2569a5b6 topic=arn:aws:sns:us-east-1:731841313232:drone-agent-lifecycle
DEBU[0000] Polling metadata service for spot termination notices
INFO[0000] Listening for lifecycle notifications
DEBU[0000] Polling sqs for messages queueURL=https://sqs.us-east-1.amazonaws.com/731841313232/lifecycled-i-0f5c95a0d2569a5b6
^CINFO[0008] Shutting down gracefully...
DEBU[0008] Deleting sns subscription arn=arn:aws:sns:us-east-1:731841313232:drone-agent-lifecycle:c5669c67-b0e9-4d03-87f3-ff3dd2e88ff7
DEBU[0008] Deleting sqs queue url=https://sqs.us-east-1.amazonaws.com/731841313232/lifecycled-i-0f5c95a0d2569a5b6
And after executing a handler:
+ lifecycled --sns-topic arn:aws:sns:us-east-1:731841313232:drone-agent-lifecycle --handler /cleanup.sh --debug
INFO[0000] Looking up instance id from metadata service
DEBU[0000] Creating sqs queue queue=lifecycled-i-0f5c95a0d2569a5b6
DEBU[0002] Looking up sqs queue url queue=lifecycled-i-0f5c95a0d2569a5b6
DEBU[0002] Subscribing queue to sns topic queue=lifecycled-i-0f5c95a0d2569a5b6 topic=arn:aws:sns:us-east-1:731841313232:drone-agent-lifecycle
DEBU[0002] Polling metadata service for spot termination notices
INFO[0002] Listening for lifecycle notifications
DEBU[0002] Polling sqs for messages queueURL=https://sqs.us-east-1.amazonaws.com/731841313232/lifecycled-i-0f5c95a0d2569a5b6e
DEBU[0017] Received an SQS message subject=Auto Scaling: Lifecycle action 'TERMINATING' for instance i-0f5c95a0d2569a5b6 in progress. type=Notification
INFO[0017] Executing handler handler=/cleanup.sh instanceid=i-0f5c95a0d2569a5b6 transition=autoscaling:EC2_INSTANCE_TERMINATING
+ echo -e Stopping Service
Stopping Service
INFO[0017] Handler finished successfully duration=268.707347ms handler=/cleanup.sh instanceid=i-0f5c95a0d2569a5b6 transition=autoscaling:EC2_INSTANCE_TERMINATING
INFO[0018] Lifecycle action completed successfully instanceid=i-0f5c95a0d2569a5b6 transition=autoscaling:EC2_INSTANCE_TERMINATING
Connection to 10.215.93.133 closed by remote host.
Connection to 10.215.93.133 closed.
It doesn't get a chance to call the cleanup, I'm not sure why.
My suspicion is that in some cases the instance is shutdown so quickly that the cleanup doesn't get a chance to execute.
My suspicion is that in some cases the instance is shutdown so quickly that the cleanup doesn't get a chance to execute.
I might have misread the code, but it doesn't seem like this deferred call is ever actually run. Since daemon.Start()
will block until d.Queue.Receive(ch)
returns, which it does not do unless an error occurs?
If that is correct, then the only way lifecycled will shut down and clean up the queue is:
So in a world where SQS functions without problems; when lifecycled handles a termination notice, it will execute the handler, issue a "continue" for the termination, and then wait for shutdown to be sent to the instance (which is sent by AWS when the termination continues)? In this case, would it make sense to look whether it received a termination notice, and continue with the cleanup after the "continue" has been sent - since it would give it more time to clean up the queue?
I might be missing a lot here, but this is how I read the code 😛
Shouldn't termination always end up triggering option B? Shutdown of lifecycled will result in a signal.
In this case, would it make sense to look whether it received a termination notice, and continue with the cleanup after the "continue" has been sent - since it would give it more time to clean up the queue?
Yup, I agree with this. I haven't ever been able to substantiate that the issue is that there isn't time to cleanup the queue though, but it couldn't hurt.
Shouldn't termination always end up triggering option B? Shutdown of lifecycled will result in a signal.
Yep, but in order to have more time it might be an idea to start cleaning up the queue immediately if it was a termination notice and the handler has executed successfully?
Having played around with this I'm not entirely sure that the problem is that there isn't enough time to clean up the queues. When the termination proceeds, AWS sends a shutdown -h
to the instance and waits for ~30 seconds before pulling the plug on the shutdown routine - and 30 seconds should be more than enough time to unsubscribe from SNS/delete SQS?
I've done some tests with my modified version of lifecycled
and found that in all cases it manages to unsubscribe and delete the queue before being shut down:
sudo shutdown -h now
It's very hard to debug this stuff since shutdown -h
kills SSH and prevents us from using tail -f
on the log file. But after stopping the instance in the console, and then starting it again, I could see from the logs that my program/daemon (which is very similar to lifecycled
) was allowed to gracefully shut down.
Lessons:
cleanupFunc
might have run even though logs look like this: https://github.com/buildkite/lifecycled/issues/12#issuecomment-364618568SIGTERM
. Same when running systemctl stop <name of service>
. So we might not need to capture all the other signals.multi-user.target
?) prevents us from seeing the full logs.CloudWatch:
Related to the last point above; awslogsd.service
which is installed with the awslogs agent, shuts down pretty quickly and since awslogs only sends logs every 5 seconds it's pretty much guaranteed that the most interesting parts of the lifecycled
logs are not going to show up in CloudWatch. Unless we first log everything, sleep for ~20 seconds and then complete the lifecycle (feels very hacky). tl;dr I think we either need to use a logrus
hook to write directly to CloudWatch, or live without lifecycled
logs in CloudWatch.
Conclusion:
In the case of lifecycled
, I can't really say for sure why it is failing to clean up SQS queues (although I've seen it happen). As discussed above, if the instance receives SIGTERM
it should be captured and trigger the cleanupFunc. Even if you are running this unit file, the default TimeoutStopSec
should be 90 seconds, which should be more than enough time to shut down gracefully. There is no After=network-online.target
in your unit file, but I have one in my own (but I'm still having this issue myself).
tl;dr: Very hard to debug this since it does not happen every time, and the logs disappear since the instance is terminating. It's tempting to just create lifecycle hooks with short-ish timeouts and never signaling the "continue", but instead just stop heartbeating and shut down the daemon (after running the handler script), and then just waiting for the timeout - that way we would at least be able to see the logs for lifecycled
in CloudWatch.
I've rolled all the recent changes you've made into v2.0.2 @itsdalmo. I'm going to release it in the latest Elastic Stacks and monitor. Thanks very much for your efforts.
Unfortunately the changes in v2.0.2 haven't made a difference, but I think that https://github.com/buildkite/lifecycled/issues/34 is a strong candidate. Will aim for a new point release tomorrow.
Just noticed this problem also https://github.com/buildkite/lifecycled/blob/master/daemon.go#L179. Since we have two go-routines reading from the same channel, executeHandler
might actually read a signal in place of the go-routine in main.go
, which means that the context is never cancelled.
And since we currently don't have a happy path, lifecycled
is left running after executeHandler
returns. This could lead to a situation where a SIGTERM
is caught by the wrong go-routine (i.e. passed to the handler process), and when the TimeoutStopSec
is reached we just get a SIGKILL
and no chance to do a graceful shutdown.
This will likely be solved by #42, but just to be safe I think its best to just listen for ctx.Done()
in executeHandler
and just send a generic os.Interrupt
to the child process (if we want it to get the same signal as the root process, we need to use two channels).
I believe this issue should be resolved in v2.1.1 (and the next release); at least it has been for autoscaling termination notices in my tests. Have you had a chance to test spot terminations @lox?
This was fixed in 3.0.0, finally 🎉
See https://github.com/buildkite/elastic-ci-stack-for-aws/issues/358