SQS queues not being cleaned up (lifecycled receiving PIPE instead of TERM)

peter-svensson commented 6 years ago

See https://github.com/buildkite/elastic-ci-stack-for-aws/issues/358

analytically commented 6 years ago

+1

tonglil commented 6 years ago

The difference between shutting down lifecycled manually (with ctrl-c):

+ lifecycled --sns-topic arn:aws:sns:us-east-1:731841313232:drone-agent-lifecycle --handler /cleanup.sh --debug
INFO[0000] Looking up instance id from metadata service 
DEBU[0000] Creating sqs queue                            queue=lifecycled-i-0f5c95a0d2569a5b6
DEBU[0000] Looking up sqs queue url                      queue=lifecycled-i-0f5c95a0d2569a5b6
DEBU[0000] Subscribing queue to sns topic                queue=lifecycled-i-0f5c95a0d2569a5b6 topic=arn:aws:sns:us-east-1:731841313232:drone-agent-lifecycle
DEBU[0000] Polling metadata service for spot termination notices 
INFO[0000] Listening for lifecycle notifications        
DEBU[0000] Polling sqs for messages                      queueURL=https://sqs.us-east-1.amazonaws.com/731841313232/lifecycled-i-0f5c95a0d2569a5b6
^CINFO[0008] Shutting down gracefully...                  
DEBU[0008] Deleting sns subscription                     arn=arn:aws:sns:us-east-1:731841313232:drone-agent-lifecycle:c5669c67-b0e9-4d03-87f3-ff3dd2e88ff7
DEBU[0008] Deleting sqs queue                            url=https://sqs.us-east-1.amazonaws.com/731841313232/lifecycled-i-0f5c95a0d2569a5b6

And after executing a handler:

+ lifecycled --sns-topic arn:aws:sns:us-east-1:731841313232:drone-agent-lifecycle --handler /cleanup.sh --debug
INFO[0000] Looking up instance id from metadata service 
DEBU[0000] Creating sqs queue                            queue=lifecycled-i-0f5c95a0d2569a5b6
DEBU[0002] Looking up sqs queue url                      queue=lifecycled-i-0f5c95a0d2569a5b6
DEBU[0002] Subscribing queue to sns topic                queue=lifecycled-i-0f5c95a0d2569a5b6 topic=arn:aws:sns:us-east-1:731841313232:drone-agent-lifecycle
DEBU[0002] Polling metadata service for spot termination notices 
INFO[0002] Listening for lifecycle notifications        
DEBU[0002] Polling sqs for messages                      queueURL=https://sqs.us-east-1.amazonaws.com/731841313232/lifecycled-i-0f5c95a0d2569a5b6e
DEBU[0017] Received an SQS message                       subject=Auto Scaling:  Lifecycle action 'TERMINATING' for instance i-0f5c95a0d2569a5b6 in progress. type=Notification
INFO[0017] Executing handler                             handler=/cleanup.sh instanceid=i-0f5c95a0d2569a5b6 transition=autoscaling:EC2_INSTANCE_TERMINATING
+ echo -e Stopping Service
Stopping Service
INFO[0017] Handler finished successfully                 duration=268.707347ms handler=/cleanup.sh instanceid=i-0f5c95a0d2569a5b6 transition=autoscaling:EC2_INSTANCE_TERMINATING
INFO[0018] Lifecycle action completed successfully       instanceid=i-0f5c95a0d2569a5b6 transition=autoscaling:EC2_INSTANCE_TERMINATING
Connection to 10.215.93.133 closed by remote host.
Connection to 10.215.93.133 closed.

It doesn't get a chance to call the cleanup, I'm not sure why.

lox commented 6 years ago

My suspicion is that in some cases the instance is shutdown so quickly that the cleanup doesn't get a chance to execute.

itsdalmo commented 6 years ago

My suspicion is that in some cases the instance is shutdown so quickly that the cleanup doesn't get a chance to execute.

I might have misread the code, but it doesn't seem like this deferred call is ever actually run. Since daemon.Start() will block until d.Queue.Receive(ch) returns, which it does not do unless an error occurs?

If that is correct, then the only way lifecycled will shut down and clean up the queue is:

A: By getting an error when polling SQS.
B: By catching a signal.

So in a world where SQS functions without problems; when lifecycled handles a termination notice, it will execute the handler, issue a "continue" for the termination, and then wait for shutdown to be sent to the instance (which is sent by AWS when the termination continues)? In this case, would it make sense to look whether it received a termination notice, and continue with the cleanup after the "continue" has been sent - since it would give it more time to clean up the queue?

I might be missing a lot here, but this is how I read the code 😛

lox commented 6 years ago

Shouldn't termination always end up triggering option B? Shutdown of lifecycled will result in a signal.

lox commented 6 years ago

In this case, would it make sense to look whether it received a termination notice, and continue with the cleanup after the "continue" has been sent - since it would give it more time to clean up the queue?

Yup, I agree with this. I haven't ever been able to substantiate that the issue is that there isn't time to cleanup the queue though, but it couldn't hurt.

itsdalmo commented 6 years ago

Shouldn't termination always end up triggering option B? Shutdown of lifecycled will result in a signal.

Yep, but in order to have more time it might be an idea to start cleaning up the queue immediately if it was a termination notice and the handler has executed successfully?

itsdalmo commented 6 years ago

Having played around with this I'm not entirely sure that the problem is that there isn't enough time to clean up the queues. When the termination proceeds, AWS sends a shutdown -h to the instance and waits for ~30 seconds before pulling the plug on the shutdown routine - and 30 seconds should be more than enough time to unsubscribe from SNS/delete SQS?

itsdalmo commented 6 years ago

I've done some tests with my modified version of lifecycled and found that in all cases it manages to unsubscribe and delete the queue before being shut down:

SSH into the instance and sudo shutdown -h now
Termination notice (lifecycle hook) which ran a handler to completion.
Terminate instance in console.
Stop instance in console (and started it again to see logs).

It's very hard to debug this stuff since shutdown -h kills SSH and prevents us from using tail -f on the log file. But after stopping the instance in the console, and then starting it again, I could see from the logs that my program/daemon (which is very similar to lifecycled) was allowed to gracefully shut down.

Lessons:

cleanupFunc might have run even though logs look like this: https://github.com/buildkite/lifecycled/issues/12#issuecomment-364618568
When issuing stop from the aws console, the program received a SIGTERM. Same when running systemctl stop <name of service>. So we might not need to capture all the other signals.
Console/SSH being shut down (multi-user.target?) prevents us from seeing the full logs.

CloudWatch: Related to the last point above; awslogsd.service which is installed with the awslogs agent, shuts down pretty quickly and since awslogs only sends logs every 5 seconds it's pretty much guaranteed that the most interesting parts of the lifecycled logs are not going to show up in CloudWatch. Unless we first log everything, sleep for ~20 seconds and then complete the lifecycle (feels very hacky). tl;dr I think we either need to use a logrus hook to write directly to CloudWatch, or live without lifecycled logs in CloudWatch.

Conclusion: In the case of lifecycled, I can't really say for sure why it is failing to clean up SQS queues (although I've seen it happen). As discussed above, if the instance receives SIGTERM it should be captured and trigger the cleanupFunc. Even if you are running this unit file, the default TimeoutStopSec should be 90 seconds, which should be more than enough time to shut down gracefully. There is no After=network-online.target in your unit file, but I have one in my own (but I'm still having this issue myself).

tl;dr: Very hard to debug this since it does not happen every time, and the logs disappear since the instance is terminating. It's tempting to just create lifecycle hooks with short-ish timeouts and never signaling the "continue", but instead just stop heartbeating and shut down the daemon (after running the handler script), and then just waiting for the timeout - that way we would at least be able to see the logs for lifecycled in CloudWatch.

lox commented 6 years ago

I've rolled all the recent changes you've made into v2.0.2 @itsdalmo. I'm going to release it in the latest Elastic Stacks and monitor. Thanks very much for your efforts.

lox commented 6 years ago

Unfortunately the changes in v2.0.2 haven't made a difference, but I think that https://github.com/buildkite/lifecycled/issues/34 is a strong candidate. Will aim for a new point release tomorrow.

itsdalmo commented 6 years ago

Just noticed this problem also https://github.com/buildkite/lifecycled/blob/master/daemon.go#L179. Since we have two go-routines reading from the same channel, executeHandler might actually read a signal in place of the go-routine in main.go, which means that the context is never cancelled.

And since we currently don't have a happy path, lifecycled is left running after executeHandler returns. This could lead to a situation where a SIGTERM is caught by the wrong go-routine (i.e. passed to the handler process), and when the TimeoutStopSec is reached we just get a SIGKILL and no chance to do a graceful shutdown.

This will likely be solved by #42, but just to be safe I think its best to just listen for ctx.Done() in executeHandler and just send a generic os.Interrupt to the child process (if we want it to get the same signal as the root process, we need to use two channels).

itsdalmo commented 6 years ago

I believe this issue should be resolved in v2.1.1 (and the next release); at least it has been for autoscaling termination notices in my tests. Have you had a chance to test spot terminations @lox?

lox commented 5 years ago

This was fixed in 3.0.0, finally 🎉

buildkite / lifecycled

SQS queues not being cleaned up (lifecycled receiving PIPE instead of TERM) #12