For hackdays I created a queue simulator and experimented with different implementations of queues based on aging. Although I tried a bunch of different solutions, this document outlines what I think is the most promising one.
Right now we have a lot of different queues with different priorities as Kir explained in his JobsDB Project. Tuning the performance characteristics of each queue is not trivial since there are multiple types of job workers each working on multiple job queues.
What we really want is to just tell the queue system a particular job's SLO of how long it is acceptable for it to wait in the queue. These could be similar to our current Jobs SLOs we monitor, for example:
Job Type | SLO |
---|---|
payment | 5 sec |
default | 30 sec |
webhook | 5 min |
all workers call the same dequeue operation (i.e. no special-purpose workers)
Here is what the algorithm looks like in the code:
func dequeueNextJob() *Job{
for _,q:= range orderedQueues{ // queues ordered from shortest SLO, to longest
if q.Length() <=0{
continue
}
if q.PeekPriority() <=currentTime{ // PeekPriority() returns expected dequeue time
return q.Pop()
}
}
// handle case where everyone is ahead of schedule ...
// ...
}
webhook
) will get sacrificed so that
high-priority queues (eg. payment
can meet their SLOs)the high-priority queues will throttle themselves not to starve the low-priority queues
Example of WAIT TIMES for a high-load event:
SLOs: payment
= 5000ms, default
= 30000ms, webhook
=300000ms
In the above example, payment wait times go to about 5 seconds, default
go to about 30 seconds, and webhook
is sacrificed until the high load ends.