Closed alexsnaps closed 4 months ago
Today, Limitador
will do n + 1
round-trips to Redis (or in this case the cloud provider's Redis-alike, e.g. Elasticache) when "in the happy path", where n
is amount of limits being hit.
sequenceDiagram
participant E as Envoy
participant L as Limitador
box grey Cloud provider's Redis-alike
participant R as Redis
end
E->>+L: should_rate_limit?
L->>R: MGET
R->>L: counter values
loop for every counter
L->>R: LUA script increment
R->>L: ()
end
L->>-E: Ok!
So say you have 3 limits (per hour, per day and per month) and your user is currently not above the threshold of these 3 limits, we'd need to do 4 round-trips to Redis to let them thru and increment the counters. Based on the testing above, we can fairly much confirm that a single round-trip to Redis is about the figures from cloudping.
Instead of going to Redis n + 1
times, we could invoke a server-side Lua script that increments all counters in one go and returns the latest values in a single round-trip. While 1
is obviously much better than the current n + 1
, a single round-trip time quickly goes in double digit milliseconds. So going to Redis on every single request that is rate limited seems unreasonable.
CachedRedisStorage
We do have an alternate Redis-backed AsyncCounterStorage
implementation that caches counter values in Limitador
's memory, avoiding needing the round-trip to the backend Redis storage.
sequenceDiagram
participant E as Envoy
participant L as Limitador
participant C as Cached counter
box grey Cloud provider's Redis-alike
participant R as Redis
end
E->>+L: should_rate_limit?
L->>C: Read local values
L->>R: Fetch missing counters
R->>L: values & ttl
L->>C: (Store and) Increment counters
L->>-E: Ok!
loop Every second (by default)
loop Per counter
L->>R: LUA script increment
R->>L: ()
end
end
While this addresses most of the cases, the jitter on the user's request remains with a max
of 1 time the round-trip to the Redis' region from the call-site (tho right now it's worse than one rtt, because of the blocking nature of the write-behind task). At least it's paid once, rather than once with an addition round-trip per counter.
While this is probably the easiest way forward addressing what we've now seen happening in AWS, there are few items to address:
--ratio
on the command line… Obviously, the list above are small improvements for us to get closer to something sustainable in such a deployment. Probably some better caching strategy (rather than time based expiry), e.g. Hi/Lo, would be preferable, as we know we are dealing with counters that tends towards a known limit. And eventually get to completely avoid the need for coordination across the different limitador, probably by leveraging CRDTs, like a grow-only counter.
In order to better understand the implications of leveraging the cloud providers Redis instances, we want to build a latency distribution matrix for all the different providers, but starting our with AWS's Elasticache.
We need a distribution for a LRT that:
We then need to test these in multiple deployment scenarios:
That use:
Performance tests
Latencies in multiple deployment scenarios
Using cloudping as latency data source
EDIT: no need to test further. We confirmed expected understanding of limitdor's request to redis model. The
N+1
model.Scenarios with transient gw &
redis_cached
Scenario with write-behind-lock https://github.com/Kuadrant/limitador/pull/276