Closed lox closed 6 years ago
I think a single SQS queue, consumed by each instance, where each instance also runs serf so can forward the message to the correct peer: https://www.serf.io
I don't have much to add I'm sorry…
All good @toolmantim, figured I'd include you anyway because I like your opinions 🙇🏼♂️
Regarding the different options:
There are other options as well:
But overall, I don't think what we have today (ephemeral queues) is a bad solution either. Is the biggest problem that the queues are not being cleaned up consistently?
Yeah, I can't think of any good way to make a single queue work better on reflection. It falls apart pretty quickly at any decent scale of instances for the reasons you outlined @itsdalmo. I agree 100% with your other conclusions too.
Is the biggest problem that the queues are not being cleaned up consistently?
Yup, pretty much. Let's spend some time fixing that.
I'm going to close this for now. Thanks for your thoughts everyone.
We moved from polling a central queue to the current model of creating per-instance queues in https://github.com/buildkite/lifecycled/pull/9, and frankly I think it's been a huge failure.
We made the change for because in the previous model large pools of instances (think 100+) all polling the same sqs queue and then releasing the messages not destined for them would delay the time it took for the actual instance to get the message. Distributed polling would also mean that we'd sometimes see rate limits hit for SQS API's.
Perhaps we should look at going back to the previous model, or some alternative, like a central lambda that listens to the SNS topic and then sends a message to lifecycled running on instances.
I'd love thoughts / feels @sj26 @toolmantim @itsdalmo.