Kuadrant / architecture

Architecture Documents
0 stars 10 forks source link

Measure the latency of the different "happy" scenarios #62

Closed alexsnaps closed 4 months ago

alexsnaps commented 6 months ago

In order to better understand the implications of leveraging the cloud providers Redis instances, we want to build a latency distribution matrix for all the different providers, but starting our with AWS's Elasticache.

We need a distribution for a LRT that:

We then need to test these in multiple deployment scenarios:

That use:

Performance tests

Latencies in multiple deployment scenarios

Using cloudping as latency data source

EDIT: no need to test further. We confirmed expected understanding of limitdor's request to redis model. The N+1 model.

Scenarios with transient gw & redis_cached

Scenario with write-behind-lock https://github.com/Kuadrant/limitador/pull/276

alexsnaps commented 5 months ago

What happens (currently)

Today, Limitador will do n + 1 round-trips to Redis (or in this case the cloud provider's Redis-alike, e.g. Elasticache) when "in the happy path", where n is amount of limits being hit.

sequenceDiagram
    participant E as Envoy
    participant L as Limitador
    box grey Cloud provider's Redis-alike
        participant R as Redis
    end

    E->>+L: should_rate_limit?
    L->>R: MGET
    R->>L: counter values
    loop for every counter
        L->>R: LUA script increment
        R->>L: ()
    end
    L->>-E: Ok!

So say you have 3 limits (per hour, per day and per month) and your user is currently not above the threshold of these 3 limits, we'd need to do 4 round-trips to Redis to let them thru and increment the counters. Based on the testing above, we can fairly much confirm that a single round-trip to Redis is about the figures from cloudping.

What we could do about it (for now)

Improve on the current design

Instead of going to Redis n + 1 times, we could invoke a server-side Lua script that increments all counters in one go and returns the latest values in a single round-trip. While 1 is obviously much better than the current n + 1, a single round-trip time quickly goes in double digit milliseconds. So going to Redis on every single request that is rate limited seems unreasonable.

Leverage the existing CachedRedisStorage

We do have an alternate Redis-backed AsyncCounterStorage implementation that caches counter values in Limitador's memory, avoiding needing the round-trip to the backend Redis storage.

sequenceDiagram
    participant E as Envoy
    participant L as Limitador
    participant C as Cached counter
    box grey Cloud provider's Redis-alike
        participant R as Redis
    end

    E->>+L: should_rate_limit?
    L->>C: Read local values
    L->>R: Fetch missing counters 
    R->>L: values & ttl
    L->>C: (Store and) Increment counters
    L->>-E: Ok!
        loop Every second (by default)
        loop Per counter
           L->>R: LUA script increment
           R->>L: ()
        end
    end

While this addresses most of the cases, the jitter on the user's request remains with a max of 1 time the round-trip to the Redis' region from the call-site (tho right now it's worse than one rtt, because of the blocking nature of the write-behind task). At least it's paid once, rather than once with an addition round-trip per counter.

While this is probably the easiest way forward addressing what we've now seen happening in AWS, there are few items to address:

alexsnaps commented 5 months ago

Obviously, the list above are small improvements for us to get closer to something sustainable in such a deployment. Probably some better caching strategy (rather than time based expiry), e.g. Hi/Lo, would be preferable, as we know we are dealing with counters that tends towards a known limit. And eventually get to completely avoid the need for coordination across the different limitador, probably by leveraging CRDTs, like a grow-only counter.