Open kirs opened 5 years ago
With the model described here, no lower SLO jobs would be started until higher SLO jobs are worked off.
This is not correct. I did not do a good job of explaining the algorithm I don't think.
Lower SLO jobs would be started if higher SLO jobs have not waited long enough. For example, if a payment job has an SLO of 5 seconds and it's in the queue, and there's a million webhooks that are super behind. The million webhooks will start getting worked off until that payment job has waited 5 seconds.
Sorry it might be easier to just to white-board this thing.
Ohh right, I've missed that.
Rephrasing my point (which I think might be still relevant):
If we have both payments (SLO 5s) and webhooks (SLO 30s) backlogged, with jobs in both waiting for equally long, would we want to process them equally or would we want to shift priority slightly from one towards another?
If we have both payments (SLO 5s) and webhooks (SLO 30s) backlogged, with jobs in both waiting for equally long, would we want to process them equally or would we want to shift priority slightly from one towards another?
If they're both behind, the algorithm will always choose the lower SLO queue first. So if payment is behind, ALL workers will work off payment until the first job in the queue is ahead of schedule (even by a fraction of a second).
But that doesn't mean we starve webhooks, because when we enqueue a payment job we have 5s before that job becomes 'top priority'. We don't just blindly dequeue from the lowest-SLO queue.
The real interesting question is what should we do if all queues are meeting their SLOs? Should we just make payments super fast? Or should we work off the longest queue? This decision is more important than it seems, especially if we're on the verge of falling behind.
With the model described here, no lower SLO jobs would be started until higher SLO jobs are worked off. In 99% of time this is what we want to follow, except when something like
payments
starves the system so hard that no webhooks or shipping rates are processed.The way we've solved it with the current model is to always allocate some number of workers to low-SLO queues:
This model allowed us allocate some small number of workers to queues on the bottom of the list, to have isolated capacity in case higher-SLO queues are too overwhelmed. And when they have nothing to process in their preferred queue, they would help other queues - like
jobs-low
would process something from thepayments
queue.This model helps with isolation, but also takes a lot of manual decisions to configure it. Perhaps we could bake something similar into SLO queues?
One idea that I have is to use weights:
So at the end it it goes from "pure SLO queue" to "SLO queue + some amount of randomness" to give some guaranteed capacity to low-priority queues.
Looking forward to hear what you think.