Go back to polling vs ephemeral SQS queues

lox commented 6 years ago

We moved from polling a central queue to the current model of creating per-instance queues in https://github.com/buildkite/lifecycled/pull/9, and frankly I think it's been a huge failure.

We made the change for because in the previous model large pools of instances (think 100+) all polling the same sqs queue and then releasing the messages not destined for them would delay the time it took for the actual instance to get the message. Distributed polling would also mean that we'd sometimes see rate limits hit for SQS API's.

Perhaps we should look at going back to the previous model, or some alternative, like a central lambda that listens to the SNS topic and then sends a message to lifecycled running on instances.

I'd love thoughts / feels @sj26 @toolmantim @itsdalmo.

sj26 commented 6 years ago

I think a single SQS queue, consumed by each instance, where each instance also runs serf so can forward the message to the correct peer: https://www.serf.io

toolmantim commented 6 years ago

I don't have much to add I'm sorry…

lox commented 6 years ago

All good @toolmantim, figured I'd include you anyway because I like your opinions 🙇🏼‍♂️

itsdalmo commented 6 years ago

Regarding the different options:

Single SQS: If each instance only handles (or deletes) notifications directed at itself and without an ability to paginate the queue messages, could large clusters end up in a situation where they are dependent on "luck of the draw" in order to find their own notifications?
- E.g., if I have 10.000 instances in an ASG, and decide to scale back by 10%, each time instance A receives a message from the queue it has a 1/1000 chance of getting it's own since the message received is random? And looking at a single instance in isolation, the chances are not any better on the 2nd attempt. However, some of those 1000 images should have got their notification on the first attempt, which improves the situation for the remaining instances?
Single SQS w/forwarding (using serf or Lambda): Increases complexity a lot, and requires opening ingress on the security group in order for the instances to talk to each other - which I'd be very reluctant to do myself.

There are other options as well:

Single SQS, lambda handler and SSM RunCommand:
- Pro: 3/3 services managed by AWS.
- Con: Complexity, and requires running SSM Agent on the instance.
Single SQS, write lifecycle notifications to S3, check S3 to know when it's time to shut down:
- Details: Processing messages is made generic by only writing the notification to S3, and each instance instead discovers that they are being shut down by looking for their instance ID in the S3 bucket (and the bucket is cleaned up automatically by setting an expiration on the files).
- Pros: 2/2 services managed by AWS. Don't have to open ingress, no "luck of the draw".
- Cons: ?

But overall, I don't think what we have today (ephemeral queues) is a bad solution either. Is the biggest problem that the queues are not being cleaned up consistently?

lox commented 6 years ago

Yeah, I can't think of any good way to make a single queue work better on reflection. It falls apart pretty quickly at any decent scale of instances for the reasons you outlined @itsdalmo. I agree 100% with your other conclusions too.

Is the biggest problem that the queues are not being cleaned up consistently?

Yup, pretty much. Let's spend some time fixing that.

I'm going to close this for now. Thanks for your thoughts everyone.

buildkite / lifecycled

Go back to polling vs ephemeral SQS queues #22