Closed htdvisser closed 3 years ago
In https://github.com/TheThingsIndustries/lorawan-stack/issues/546#issuecomment-500902816 @johanstokking suggested the following classes:
** including collaborator changes, API key changes, etc, which may need revocation, email notifications, etc
Network operators must have insight in why the server rejected requests because of rate limits.
Adding @neoaggelos to look into rate limiting on gateway connections to prevent bugs in the gateways (sending uplinks/statuses in a loop) from taking down the GS
@neoaggelos we need a flood check in the UDP firewall. This could a simple ring buffer of the last 10 messages or so, and if the oldest (next) passed message is within a certain interval (order of tens of milliseconds), don't pass the message.
Do we not need something similar for the other frontends as well (mqtt/grpc/basicstation)? If so, the flood check maybe has to be implemented at the GS level (or at some point in between, but it should be frontend-agnostic).
For resending a bunch of messages from downstream (i.e. buffered on gateways, satellite groundstations, etc, see https://github.com/TheThingsNetwork/lorawan-stack/issues/2708), we do need to allow for a temporary burst and/or throttling, without returning errors.
For resending a bunch of messages from downstream (i.e. buffered on gateways, satellite groundstations, etc, see #2708), we do need to allow for a temporary burst and/or throttling, without returning errors.
I think we should turn this around and encourage any gateway maker to actually have some rate limitting on their side
I think we should turn this around and encourage any gateway maker to actually have
There’s some packet loss here; are you suggesting to move the responsibility to the gateway maker to not burst?
I think we should make it as easy as possible for the edge; if there’s a window of sending data in bulk, let them do it. We just need to figure it out with a message queue, potentially already in GS.
I think we should turn this around and encourage any gateway maker to actually have
There’s some packet loss here; are you suggesting to move the responsibility to the gateway maker to not burst?
I think we should make it as easy as possible for the edge; if there’s a window of sending data in bulk, let them do it. We just need to figure it out with a message queue, potentially already in GS.
That's what I meant, but I pressed enter too quickly.
In retrospective, if the sending window is limited it makes sense to allow them to send everything at once, but maybe we could think of a special infrastructure endpoint (i.e. dedicated instances) for this - the PPS is very, very large from these buffered gateways and it would be a pitty to have GS live-traffic performance be affected by this. I can imagine that satellite communications are almost never going to be live.
There's a difference between server-to-server upload of a lot of frames, i.e. from a satellite ground station, versus a single gateway that has a local buffer which it flushes.
For those ground stations, they should be using Packet Broker anyway, and not act as a gateway. Rate limiting works very different there.
As for individual gateways that are flushing their local buffers, this should be handled by GS.
As discussed during the call today, we'll come up with separate APIs for bulk-offloading buffered uplinks. This means that we can now go back to the original issue, and implement a generic ratelimit package that lets us limit:
Depending on the place, we should think about what to do before/when rate limited:
As for configuration, I guess we can start with some "fallback" configurations of "[N] calls per [duration] per [vary]". We can add more specific limits for specific methods (lower limits on security-sensitive endpoints). Perhaps also nice if we can add dynamic limits (so that we can allow more calls per hour from specific IP addresses, or respond to a (D)DoS by tweaking limits for some endpoints).
Moving this to Next Up until everything is clear, but please, @neoaggelos start working on this ASAP.
After looking into this issue in depth, and taking into consideration the existing comments and pointers, I think the implementation that makes sense the most is:
ratelimiting
package, possibly based on Throttled. It can use either local-memory or Redis record keeping. We should not enforce either (rather, make sure that both can be used).Currently, I want to avoid making too strict decisions as to how configuration is applied. Let's start with something very basic, and then shape it up as the actual implementation requirements become more clear.
What is the status and action item hee @neoaggelos ?
While working on this I have had to look at many parts of the codebase, and figure out how best to de-duplicate the rate limiting functionality. The cleaner solution so far seems to have a pkg/ratelimit
package, and use separate config per component/listener to setup rate limiting, instead of configuring rate limiting under a single section.
A challenging issue I have had to deal with during early iterations: The issue description and the comments suggest a complex rate limiting behaviour (e.g. rate limiting by IP address, request path and auth token ID). However, these checks should happen at different parts of the lifecycle of a request to make sense (e.g. rate limiting by IP address should be made as early as possible, before having to process the auth token ID).
A rough TODO:
ratelimit
package, implementing the rate limiting logic, as well rate limiter interfaces (for gRPC, echo, http)ratelimit
package, but we should avoid breaking config changes).UnaryServerInterceptor
)StreamServerInterceptor
)Some work so far can be found in https://github.com/TheThingsNetwork/lorawan-stack/tree/feature/rate-limiting. @johanstokking @htdvisser please have a look if possible and comment on the general direction of the implementation on a high-level. I could create a draft PR, but I think it's too early for that, especially if major changes are required.
Please open a draft PR so we can comment.
I do like that there's a pkg/ratelimit
with common functionality that can be used by various places in the stack.
Note that the gateway (traffic) rate limiting is actually the least of a concern now: we already have gateway upstream handlers that have implicit rate limiting and we have the UDP firewall. That being said, I think it's a good idea to replace the current UDP firewall by two generic rate limiting components: limit calls by remote IP and limit hits (messages) per session (UDP remote address). What is necessary and what is not covered by GS at the moment, is incoming connection rate limiting per remote IP.
@neoaggelos what is the status here?
@neoaggelos #3920 is merged, can this be closed?
With #3920 merged, The Things Stack has basic support for rate limiting.
Things that we still need to cover and/or consider:
Similar to the rest of the stack, apply rate limiting for downlinks that arrive by PubSubs (https://github.com/TheThingsNetwork/lorawan-stack/blob/v3.12/pkg/applicationserver/io/pubsub/pubsub.go#L180).
The key can be as:down:pubsub:app:<appUID>:pubsub:<pubsubID>
, and the rate limiting can be as:down:pubsub
.
Consider having separate rate limiting for HTTP requests before and after authentication. Currently, rate limiting is only applied after authentication, which does not protect from auth requests flooding the Identity Server.
Consider throttling connections (for Websockets, MQTT, gRPC, PubSubs) instead of terminating after the rate limit has been exceeded. The implementation here can be adding a static delay as penalty for exceeding the rate limit, similar to what happens with V2.
Consider ways to simplify configuration. The rate limiting profile configuration is documented (see https://github.com/TheThingsIndustries/lorawan-stack-docs/pull/290), but it may still be not straightforward to setup. My idea would be to keep the profiles
section, where people can set advanced configuration, but also introduce a new section (e.g. default
) where only a few constants need be set. cc @KrishnaIyer, please propose how this config section could look like.
Enable configuring IP address ranges (or subnets) for which rate limiting is not applied. This is to account for "internal" traffic or "trusted" IP addresses.
For the implementation, we could extend ratelimit.Profile
configuration with something like:
ExcludeIPAddresses: []struct{
Start net.IP
End net.IP
// OR
Network net.IPNet
}
We can then add a (ratelimit.Resource).IPAddress() (net.IP, bool)
method (returning the source IP and a boolean whether a source IP for the particular resource makes sense), and then use that from (*ratelimit.Profile).RateLimit
to return immediately if this is an excluded IP address.
cc @htdvisser, @johanstokking. Let's turn this into an umbrella issue going forward, I will create issues for each task as needed, but let's discuss first.
1. Rate limiting downlinks from PubSubs
We're getting rid of them in TTSC and TTN. They remain for TTSOS but we don't need rate limiting for that.
2. Consider having separate rate limiting for HTTP requests before and after authentication
This is the job of our front proxy.
3. Consider throttling connections (for Websockets, MQTT, gRPC, PubSubs) instead of terminating after the rate limit has been exceeded
Yes please.
4. Consider ways to simplify configuration
Haven't looked at it yet. But how about sane defaults?
5. Enable configuring IP address ranges (or subnets) for which rate limiting is not applied
We need this too. You probably just want a slice of IP masks.
@neoaggelos can you please post a status update here? What's done and what's still TODO?
We recently had quite some issues with MQTT on V2, so I'm also interested in hearing about MQTT connection limits. Can we already limit the number of active MQTT connections (per username, per IP)?
I'm also interested in hearing about MQTT connection limits. Can we already limit the number of active MQTT connections (per username, per IP)?
For Community and Cloud, we do limit rate of new connections. Limiting the maximum number of concurrent connections is basically coming with https://github.com/TheThingsIndustries/lorawan-stack/pull/2879
- Enable configuring IP address ranges (or subnets) for which rate limiting is not applied
We need this too. You probably just want a slice of IP masks.
This has been implemented in [https://github.com/TheThingsNetwork/lorawan-stack/tree/feature/rate-limiting-config-subnets], along with unit tests to verify the behaviour. If implementation seems good enough, then I can create a PR for it (otherwise, someone else can pick it up).
Summary
We have to add rate limiting to our API endpoints.
Transferred from https://github.com/TheThingsIndustries/lorawan-stack/issues/546
Why do we need this?
What is already there? What do you see now?
We currently don't have any rate limits.
What is missing? What do you want to see?
How do you propose to implement this?
For the actual implementation we can just use an existing library like https://github.com/throttled/throttled, which already has support for both in-memory and shared Redis record keeping.
The "key" for rate limiting would be
<class>:<limitBy>:<IP/AccessTokenID/TargetEntityID>
. We still need to define the rate limiting classes:For (1): when we have #443, valid access tokens are cached. This means that we can put the rate limiting on the func that verifies non-cached access tokens (the token class above). We also need to put IP rate limiting on the username/password login of the OAuth server (the login class above).
For (2) we need to define for each RPC whether we limit by access token ID or by target entity ID, and which rate limiting class should be applied.