alexsnaps commented 6 months ago

In order to better understand the implications of leveraging the cloud providers Redis instances, we want to build a latency distribution matrix for all the different providers, but starting our with AWS's Elasticache.

We need a distribution for a LRT that:

Increments a counter, without ever reaching the limit (i.e. a RW usecase for Redis)
Reaches a limit and consistently responds with a 429 HTTP response code for a period
The same as the above, but with qualified counters (which we'll use to introduce data locality)

We then need to test these in multiple deployment scenarios:

From one location: same as the "master" (if applicable), from a different AZ, from different regions across the globe
From multiple location, distributed according the above

That use:

all the same limit (single globally shared counter)
each use a qualified counter for their az/region

Performance tests

[x] #71
[x] #72
[x] #73
[x] #74
[x] #75

Latencies in multiple deployment scenarios

Using cloudping as latency data source

[x] Same AZ #74
[x] #77
[x] #78
[x] ~#79~
[x] ~#80~
[x] ~#81~
[x] ~#82~
[x] ~#83~
[x] ~#84~

EDIT: no need to test further. We confirmed expected understanding of limitdor's request to redis model. The N+1 model.

Scenarios with transient gw & `redis_cached`

[ ] Test with 3 limits across 2 regions
[ ] Test with 3 limits & redis_cached instead across these 2 same regions
[ ] Do we want to test the bin encoding as well? It should reduce the amount of bytes sent over the wires...

Scenario with write-behind-lock https://github.com/Kuadrant/limitador/pull/276

[x] #87
[x] #88

alexsnaps commented 5 months ago

What happens (currently)

Today, Limitador will do n + 1 round-trips to Redis (or in this case the cloud provider's Redis-alike, e.g. Elasticache) when "in the happy path", where n is amount of limits being hit.

sequenceDiagram
    participant E as Envoy
    participant L as Limitador
    box grey Cloud provider's Redis-alike
        participant R as Redis
    end

    E->>+L: should_rate_limit?
    L->>R: MGET
    R->>L: counter values
    loop for every counter
        L->>R: LUA script increment
        R->>L: ()
    end
    L->>-E: Ok!

So say you have 3 limits (per hour, per day and per month) and your user is currently not above the threshold of these 3 limits, we'd need to do 4 round-trips to Redis to let them thru and increment the counters. Based on the testing above, we can fairly much confirm that a single round-trip to Redis is about the figures from cloudping.

What we could do about it (for now)

Improve on the current design

Instead of going to Redis n + 1 times, we could invoke a server-side Lua script that increments all counters in one go and returns the latest values in a single round-trip. While 1 is obviously much better than the current n + 1, a single round-trip time quickly goes in double digit milliseconds. So going to Redis on every single request that is rate limited seems unreasonable.

Leverage the existing `CachedRedisStorage`

We do have an alternate Redis-backed AsyncCounterStorage implementation that caches counter values in Limitador's memory, avoiding needing the round-trip to the backend Redis storage.

sequenceDiagram
    participant E as Envoy
    participant L as Limitador
    participant C as Cached counter
    box grey Cloud provider's Redis-alike
        participant R as Redis
    end

    E->>+L: should_rate_limit?
    L->>C: Read local values
    L->>R: Fetch missing counters 
    R->>L: values & ttl
    L->>C: (Store and) Increment counters
    L->>-E: Ok!
        loop Every second (by default)
        loop Per counter
           L->>R: LUA script increment
           R->>L: ()
        end
    end

While this addresses most of the cases, the jitter on the user's request remains with a max of 1 time the round-trip to the Redis' region from the call-site (tho right now it's worse than one rtt, because of the blocking nature of the write-behind task). At least it's paid once, rather than once with an addition round-trip per counter.

While this is probably the easiest way forward addressing what we've now seen happening in AWS, there are few items to address:

[ ] The "write-behind" task is currently blocking reads for the entire duration of writing of each counter
[ ] Bulk update counters instead of one by one (see how big that batch can become)
[ ] Possibly returning the latest value and update the cached values
[ ] Fix --ratio on the command line…
[ ] Look into asynchronously pre-reloading the counters

alexsnaps commented 5 months ago

Obviously, the list above are small improvements for us to get closer to something sustainable in such a deployment. Probably some better caching strategy (rather than time based expiry), e.g. Hi/Lo, would be preferable, as we know we are dealing with counters that tends towards a known limit. And eventually get to completely avoid the need for coordination across the different limitador, probably by leveraging CRDTs, like a grow-only counter.

Kuadrant / architecture

Measure the latency of the different "happy" scenarios #62

Performance tests

Latencies in multiple deployment scenarios

Scenarios with transient gw & `redis_cached`

Scenario with write-behind-lock https://github.com/Kuadrant/limitador/pull/276

What happens (currently)

What we could do about it (for now)

Improve on the current design

Leverage the existing `CachedRedisStorage`

Kuadrant / architecture

Measure the latency of the different "happy" scenarios #62

Performance tests

Latencies in multiple deployment scenarios

Scenarios with transient gw & redis_cached

Scenario with write-behind-lock https://github.com/Kuadrant/limitador/pull/276

What happens (currently)

What we could do about it (for now)

Improve on the current design

Leverage the existing CachedRedisStorage

Scenarios with transient gw & `redis_cached`

Leverage the existing `CachedRedisStorage`