Rate Limiting documentation not clear about header values and behaviour

Summary

The values of the rate limiting headers are not compliant with the documentation available. The behaviours observed were:

the x-rate-limit-reset value increases through time instead of being a decreasing value.
repeated values for the x-rate-limit-available header in different requests within the same minute.

Steps to Reproduce

Log of the result value in here. Something like log.FromContext(ctx).Debug("Result: ", result) would suffice.
Log the values from the rate-limit headers in any of the cli interceptors. Example below.
Start stack with rate-limiting enabled.
Create a gateway.
Run go run ./cmd/ttn-lw-cli g get <gateway_id> with intervals of 10 seconds between each request. Script example below.

stack config - enables rate-limiting

tls:
  source: file
  root-ca: ./ca.pem
  certificate: ./cert.pem
  key: ./key.pem

rate-limiting:
  profiles:
    - name: HTTP servers
      max-per-min: 30
      associations:
        - http
    - name: Application downlink traffic
      max-per-min: 10
      associations:
        - as:down:web
        - as:down:mqtt
        - grpc:method:/ttn.v3.lorawan.v3.AppAs/DownlinkQueuePush
        - grpc:method:/ttn.v3.lorawan.v3.AppAs/DownlinkQueueReplace
    - name: Gateway connections
      max-per-min: 2
      associations:
        - gs:accept:mqtt
        - gs:accept:ws
        - grpc:stream:accept:/ttn.lorawan.v3.GtwGs/LinkGateway
    - name: gRPC API
      max-per-min: 3
      associations:
        - grpc:method
        - grpc:stream:accept
    - name: Override rate for uplink simulation
      max-per-min: 5
      associations:
        - grpc:method:/ttn.lorawan.v3.AppAs/SimulateUplink
    - name: Gateway uplink traffic
      max-per-min: 1000
      max-burst: 1500
      associations:
       - gs:up

Template script to run go run command multiple times

#!/bin/bash
for run in {1..5}; do
   go run ./cmd/ttn-lw-cli g get gtw1
   sleep 10
done

Example of logging `rate-limiting` values in interceptor

var md metadata.MD
var err error
logger := log.FromContext(ctx)
err = invoker(ctx, method, req, reply, cc, append(opts, grpc.Header(&md))...)
available, _ := strconv.Atoi(md.Get( "x-rate-limit-available")[0])
reset, _ := strconv.Atoi(md.Get( "x-rate-limit-reset")[0])
retry, _ := strconv.Atoi(md.Get( "x-rate-limit-retry")[0])
.Debug("AVAILABLE: ", available)
logger.Debug("RESET: ", reset)
logger.Debug("RETRY: ", retry)

What do you see now?

In the cli logs you can see that the x-rate-limit-available header has the same value twice within the same minute and that the x-rate-limit-reset increases. The reduced version of the log only shows the value of the headers.

The server logs contains the values of the result and limit variables that can be found in: https://github.com/TheThingsNetwork/lorawan-stack/blob/7a908e9f732cab737c36980fcde240e5cda7a942/pkg/ratelimit/grpc.go#L77

What do you want to see instead?

A x-rate-limit-reset value that represents how many seconds are left until the reset happens and a x-rate-limit-available that is unique to each request.

Environment

The Things Stack for LoRaWAN: ttn-lw-stack
Version:             3.17.0-dev
Go version:          go1.17.5
OS/Arch:             linux/amd64

How do you propose to implement this?

Investigate what is causing the discrepancies between the header values and the server's limiter values. Checking the expected behaviours within the library used for the limiter and/or confirm the places where the values are manipulated to see if there is anything that could be altering them.

How do you propose to test this?

Manually test it. Since there are already unit tests for adding the limiter to the header it might be worth it to do something more end to end, not sure if this should be within the scope of this bug or not.

Can you do this yourself and submit a Pull Request?

Yes

What was initially thought to be a bug was is in fact the expected behaviour of the GCRA algorithm that is implemented in the throttled package, meaning that our documentation gives information that leads to a wrong interpretation on how the limitation works.

When doing something like:

rate-limiting:
  profiles:
    - name: gRPC API
      max-per-min: 30
      max-burst: 10
      associations:
        - grpc:method
        - grpc:stream:accept

The max-per-min sets the rate in which requests can be processed by the service. While the max-burst defines a leeway besides the set amount of requests per second that one service can execute, the leeway value is present at the x-rate-limit-limit header.

The situation described on the issue suffices as an example, imagine a gRPC limitation with max-per-min equal to three and with a max-burst not specified, meaning that it receives the value of max-per-min. That would mean that gRPC can handle 1 request per 60/3 seconds and starts with the current leeway of 3 requests, with this in mind the following case might facilitate the visualization of how it works.

Two requests are instantly made, eliminating part of the available request set by the max-burst. The x-rate-limit-available will now be at 1 and the x-rate-limit-reset will provide a value of 40, meaning that it will take 40 seconds to replenish the leeway. After waiting 20 seconds the x-rate-limit-available amount of request will increase by one, that only happens if no request is processed in that time.

So in theory the maximum amount of requests that could be handled in one minute is max-burst + max-per-min, which is consistent with the behaviour found in the logs.

For a more detailed information of the behaviour the following links are good resources:

In short the description of the x-rate-limit found here might lead to a wrong interpretation on how the ratelimit works. After reflecting on the issue the solution seems to just be a bit more elaboration on the behaviour of the rate-limit in the documentation. But wanted to know what is your opinion on the subject @KrishnaIyer and @adriansmares.

To make it more explicit what are the values of the headers:

X-Rate-Limit-Limit: The maximum number of bursts allowed.
X-Rate-Limit-Available: Number of available burst requests for the current time period.
X-Rate-Limit-Reset: Seconds until the available amount of burst requests reaches the limit.
X-Rate-Limit-Retry: Seconds until the rate limiter accepts another request, only contains a value when available is a zero.

So x-rate-limit-reset is the amount of time you have to wait until the available resets to the original max-per-min value i.e., 3 in your example?

Also, what about x-rate-limit-retry? That's not used then?

So in theory the maximum amount of requests that could be handled in one minute is max-burst + max-per-min, which is consistent with the behaviour found in the logs.

That doesn't sound right. From what I understand, the burst should be included within the max-per-min and if someone makes a burst request, it must be deducted from the available value for the current window.

Can you run some examples with request timestamps?

I was trying to avoid use of the word resets due to behaviour of the algorithm, there isn't a hard reset of requests allowed per minute, it's a slow refill of the available amount of requests that can be made. So If one user sleeps for the amount of seconds specified in x-rate-limit-reset not necessarily means that he will be able to do a request, as long as there is an user making requests at the same pace as the rate of recovery the first user will not be able to make a request.

The x-rate-limit-retry only contains a value when x-rate-limit-available is 0, the value of the x-rate-limit-retry represents the amount of time one have to wait until x-rate-limit-available goes from 0 to 1 (when the next request will be permitted).

So in theory the maximum amount of requests that could be handled in one minute is max-burst + max-per-min, which is consistent with the behaviour found in the logs.

That doesn't sound right. From what I understand, the burst should be included within the max-per-min and if someone makes a burst request, it must be deducted from the available value for the current window.

From the comments in the throttled's code the MaxBurst defines the amount of request that will be allow to exceed the amount established by the rate in a single moment, it doesn't infer that the rate in which the request will be available will be affected as a result.

Test

Logs

burst.txt burst_reduced.txt

Script

#!/bin/bash
for run in {1..8}; do
  go run ./cmd/ttn-lw-cli g get gtw1
done
for run in {1..6}; do
  go run ./cmd/ttn-lw-cli g get gtw1
  sleep 12 # recovery rate  -> 60s / 5req-per-min
done

config used

tls:
  source: file
  root-ca: ./ca.pem
  certificate: ./cert.pem
  key: ./key.pem
rate-limiting:
   - name: gRPC API
     max-per-min: 5
     max-burst: 7
     associations:
       - grpc:method
       - grpc:stream:accept

Observations

From the script it is established that the first 9 request will be done without any timeout between them, meaning that it goes beyond the MaxBurst of 7 specified in the stack config. One would expect the 8th request to be the one to fail but in the logs the 9th request is the one that presents the error, when observing the time difference between the requests it's easier to see why.

Time window between the requests 1\~8, 2022-01-13 09:57:19.900649828 to 2022-01-13 09:57:32.474718184
max-per-min is 5, creating a request rate of 12 seconds.
The 8th request happens 12 seconds after the first request, consequently the server increases the amount of available requests by one.

The logs follow this logic since the 8th request doesn't fail, while the 9th does since its done without any timeout. The following request don't fail due to the timeout of 12 seconds, meaning that the server always make another request available within that time span.

In short, between 2022-01-13 09:57:19.900649828 and 2022-01-13 09:57:47.708527296 there are 10 requests being made.

So I read a little bit here and you're partially correct.

Here's what happens (I've verified this with testing)

MaxPerMin -> This sets the request period. Ex; if max-per-min is 5, then 5/60 = 12s. A new request can only be made once every 12 seconds.
MaxBurst -> However, this creates an exception to point 1. This is the number of requests that can be made to ignore the MaxPerMin. So if max-burst is 7, you can make 7 requests continuously within the first 12s without any gap. But after you make a burst of requests and exhaust your burst limit, you have to a. Wait for X-Rate-Limit-Retry until you can make the next single request. b. Wait for X-Rate-Limit-Reset until you can make the next max burst request. (i.e., make 7 consecutive requests).

So effectively, the max no of requests per minute is

max-per-min + max-burst - 1

(because one of the requests is part of the burst).

I think we make a mistake in our code. I don't know why we set the default burst to max-per-min -1; https://github.com/TheThingsIndustries/lorawan-stack/blob/fd2aba958be1625bec5e661af5c1d8fb416872a5/pkg/ratelimit/config.go#L93

Because now, if the max-per-min is 60 and we leave the max-burst empty, we allow upto 60 requests in 1 shot and then one request each second for the next 59 seconds. However, the client must then wait (do nothing) for the next 60s before it can send the next burst of 60.

So the default max-burst must be 1 which effectively disables bursting and we need to selectively allow bursting for some RPCs.

Also btw, it's theoretically an invalid setting to have max-burst > max-per-min since that violates the max constraint.

What do you think @adriansmares ?

The throttled godoc explains the meaning of the headers, as we copy those directly: https://github.com/throttled/throttled/blob/ef1aa857b069ed60f6f859f8b16350e5b7c8ec96/rate.go#L29-L58

I think we make a mistake in our code. I don't know why we set the default burst to max-per-min -1; https://github.com/TheThingsIndustries/lorawan-stack/blob/fd2aba958be1625bec5e661af5c1d8fb416872a5/pkg/ratelimit/config.go#L93

Because lacking a burst value results in a terrible user experience. Outside of the APIs which are on the hot path (i.e. AS webhook downlinks), you don't really use them at a steady state - every once in a while, you login into the Console, do 10 different API calls in parallel, finish looking and leave. Most relevant workloads are actually just bursts - imagine having to wait 5 seconds while switching between pages in the Console end device view.

Because now, if the max-per-min is 60 and we leave the max-burst empty, we allow upto 60 requests in 1 shot and then one request each second for the next 59 seconds. However, the client must then wait (do nothing) for the next 60s before it can send the next burst of 60.

But why is this a bad thing ? I don't follow why after spending the burst, respecting the rate is wrong.

So the default max-burst must be 1 which effectively disables bursting and we need to selectively allow bursting for some RPCs.

I would inverse this - lower the burstable amount per period for the APIs which require it. Not having bursting by default would be terrible, as you would need to wait for the period between each call, even if you just need a quick answer and that's it.

I don't follow why after spending the burst, respecting the rate is wrong.

Because the effective rate is max-per-min + max-burst - 1 which defeats calling the first parameter max-per-min. That parameter should've been called req-per-min but we're beyond that point.

I would inverse this - lower the burstable amount per period for the APIs which require it

I second this.

Not having bursting by default would be terrible, as you would need to wait for the period between each call,

That's not true in all cases. For example, with gateway claiming, we allow 20 claims/min. I don't see why one needs to make 20 requests in one shot and then be rate-limited. I'd rather spread out these calls and have clients have a delay between the calls for batch operations. Especially since the Semtech RJS does not handle floods very well.

I think we both agree that some APIs should allow bursting. We just have a different idea for the default case. I think bursting should be allowed conservatively and your idea is that we allow bursting by default and clamp where necessary. I think they're both equally valid alternatives, so I'm fine with keeping the burst behaviour we have now.

The throttled godoc explains the meaning of the headers, as we copy those directly: https://github.com/throttled/throttled/blob/ef1aa857b069ed60f6f859f8b16350e5b7c8ec96/rate.go#L29-L58

The description of the throttled comments and our documentation are quite similar, the cause of my confusion was that what I was expecting the headers to be didn't exactly match the max-per-min configuration field, but it is as @KrishnaIyer mentioned:

Because the effective rate is max-per-min + max-burst - 1 which defeats calling the first parameter max-per-min. That parameter should've been called req-per-min but we're beyond that point.

The two ideas for the default case are sensible but I'm personally more inclined to leaving the current burst behaviour, since changing it to a more conservative burst policy might lead to a unpleasant surprise to users that haven't specified the burst amount but might be using it for bursts that slightly go over the request rate.

@adriansmares and @KrishnaIyer, with all of this in mind I'm wondering if there is in fact something that should be changed. From my perspective the header documentation does give enough information. I don't think things like the algorithm used for the rate limiting are necessary for the documentation, since its objective is to clarify the usage of the tts.

Because the effective rate is max-per-min + max-burst - 1 which defeats calling the first parameter max-per-min. That parameter should've been called req-per-min but we're beyond that point.

I see it now. Correct.

@adriansmares and @KrishnaIyer, with all of this in mind I'm wondering if there is in fact something that should be changed. From my perspective the header documentation does give enough information. I don't think things like the algorithm used for the rate limiting are necessary for the documentation, since its objective is to clarify the usage of the tts.

I indeed think that the only thing that should be done here is to pay a short lip service to the GCRA algorithm (i.e. a link) in the docs, but otherwise indeed no bug here.

As part of our operations we may selectively turn off bursting for certain calls - we have https://github.com/TheThingsIndustries/lorawan-stack-aws/issues/544 for that.

TheThingsNetwork / lorawan-stack