Capacity guarantees - Githubissues

kirs commented 5 years ago

With the model described here, no lower SLO jobs would be started until higher SLO jobs are worked off. In 99% of time this is what we want to follow, except when something like payments starves the system so hard that no webhooks or shipping rates are processed.

The way we've solved it with the current model is to always allocate some number of workers to low-SLO queues:

# queues are ordered
100 x jobs-payments: QUEUES=payments,default,low,webhooks
30 x jobs-default: QUEUES=default,low,webhooks,payments
20 x jobs-webhooks: QUEUES=webhooks,payments,default,low
5 x jobs-low: QUEUES=low,webhooks,payments,default

This model allowed us allocate some small number of workers to queues on the bottom of the list, to have isolated capacity in case higher-SLO queues are too overwhelmed. And when they have nothing to process in their preferred queue, they would help other queues - like jobs-low would process something from the payments queue.

This model helps with isolation, but also takes a lot of manual decisions to configure it. Perhaps we could bake something similar into SLO queues?

One idea that I have is to use weights:

type sloQueue struct {
    slo    time.Duration
    weight float64
}

func calculateWeight(slo, min time.Duration) float64 {
    return float64(min) / float64(slo)
}

func main() {
    fmt.Println("Hello, playground")

    realtime := &sloQueue{slo: 1 * time.Second}
    payments := &sloQueue{slo: 5 * time.Second}
    rates := &sloQueue{slo: 30 * time.Second}
    webhook := &sloQueue{slo: 300 * time.Second}
    low := &sloQueue{slo: 900 * time.Second}

    orderedQueues := []*sloQueue{realtime, payments, rates, webhook, low}

    highest := realtime

    for _, q := range orderedQueues {
        q.weight = calculateWeight(q.slo, highest.slo)
        fmt.Printf("slo=%d, weight=%.3f\n", q.slo/time.Second, q.weight)
    }

    for {
        for _, q := range orderedQueues {
            if q.Length() <= 0 {
                continue
            }
            if q.PeekPriority() <= currentTime || rand.Float64() >= q.weight {
                fmt.Printf("processing %d \n", q.slo)
            }
        }
    }
}

So at the end it it goes from "pure SLO queue" to "SLO queue + some amount of randomness" to give some guaranteed capacity to low-priority queues.

Looking forward to hear what you think.

AndyKwiat commented 5 years ago

With the model described here, no lower SLO jobs would be started until higher SLO jobs are worked off.

This is not correct. I did not do a good job of explaining the algorithm I don't think.

Lower SLO jobs would be started if higher SLO jobs have not waited long enough. For example, if a payment job has an SLO of 5 seconds and it's in the queue, and there's a million webhooks that are super behind. The million webhooks will start getting worked off until that payment job has waited 5 seconds.

Sorry it might be easier to just to white-board this thing.

kirs commented 5 years ago

Ohh right, I've missed that.

Rephrasing my point (which I think might be still relevant):

If we have both payments (SLO 5s) and webhooks (SLO 30s) backlogged, with jobs in both waiting for equally long, would we want to process them equally or would we want to shift priority slightly from one towards another?

AndyKwiat commented 5 years ago

If we have both payments (SLO 5s) and webhooks (SLO 30s) backlogged, with jobs in both waiting for equally long, would we want to process them equally or would we want to shift priority slightly from one towards another?

If they're both behind, the algorithm will always choose the lower SLO queue first. So if payment is behind, ALL workers will work off payment until the first job in the queue is ahead of schedule (even by a fraction of a second).

But that doesn't mean we starve webhooks, because when we enqueue a payment job we have 5s before that job becomes 'top priority'. We don't just blindly dequeue from the lowest-SLO queue.

The real interesting question is what should we do if all queues are meeting their SLOs? Should we just make payments super fast? Or should we work off the longest queue? This decision is more important than it seems, especially if we're on the verge of falling behind.

AndyKwiat / sloqueue

Capacity guarantees #1