DDoS prevention - Githubissues

chemicL commented 5 years ago

Our experience building a control plane out of java-control-plane has shown that it's easy to overwhelm the service if there's many Envoys querying the control plane for many services.

Especially in the bootstrap phase, when deploying multiple Envoys, they won't be able to serve as a proxy if they don't get the initial configuration. Killing the control plane leads to even worse consequences, where the cluster cannot be stabilised at all. We dealt with the issues by whitelisting services and using ADS to limit the number of connections.

However, in case of some error on our side (or Envoy's), we might end up with far too many streams/requests being issued to the control plane than an instance can handle.

I'm suggesting adding three things:

rate limiting of concurrent requests handled by java-control-plane
limiting number of streams that can be open at the same time
utilising HTTP2's flow control by manually controlling transmission

The goal of the above is simply to make the SPoF component (which java-control-plane might end up being) resilient in spite of unexpected rise of demand. This also prevents cascading failures, where a misconfiguration in part of the infrastructure could kill the entire cluster.

I'm happy to submit a PR with these changes if you think this is something worth introducing here. Otherwise, we'd need to keep a fork/copy of DiscoveryServer.

ramaraochavali commented 5 years ago

I agree adding rate limiting of concurrent requests handled by java-control-plane is reasonable idea. Have you tried experimenting with rate limiting discovery requests from Envoy side using these settings https://github.com/envoyproxy/envoy/blob/02ed57d648d82c5e5ef72070b145d2a28c9b404c/api/envoy/api/v2/core/config_source.proto#L67?

chemicL commented 5 years ago

Thanks for pointing that out. The idea here is to make the control plane resilient and not relying on any external configurations. If a "malicious" client does not configure rate limiting then we're still at risk. There's many approaches one can suggest, such as adding another Envoy in front of the control plane, but the idea here is simple - java-control-plane should simply be able to prevent itself from unexpected burst in queries.

ramaraochavali commented 5 years ago

I completely agree and not opposed to adding to control plane. Just pointing out that there is a feature in Envoy you can make use of from Envoy perspective as well. Relevant discussions in Envoy https://github.com/envoyproxy/envoy/issues/4718 and https://github.com/envoyproxy/envoy/issues/2169

chemicL commented 5 years ago

Thanks for clarifying and referencing the discussions 👍 I see there's been suggestions for adding this feature to the control plane before.

snowp commented 5 years ago

Seems like a bit of this is things you can configure on the server itself, by setting various executor parameters or installing server side interceptors. The manual flow control suggestion definitely makes sense as being part of this code base though.

chemicL commented 5 years ago

I'm keen to find out which of the proposed changes can be configured using executor parameters. We've been trying to figure out a smooth way to implement this but any changes to the executors lead to undesired behaviour. Only utilizing the HTTP2 flow control were we able to actually prevent the app from growing the RAM after being bombarded with requests. I haven't found a different way to handle back-pressure in a client friendly manner, but perhaps it can be done?

chemicL commented 5 years ago

I'm closing this issue. Due to gaining more experience running our control plane here's the reasons this feature makes not much sense:

global connection limit enforcement in case of a misbehaving client would cut off healthy clients trying to connect
rate limiting, either global or per client would be a static value, which would depend on measured access patterns and could become a pain point, plus has the same drawback of the above
per-client is not a feasible solution either and would need to sum up to some global value as well

What we suggest instead is proper monitoring, alerting, and autoscaling of the control-plane.

envoyproxy / java-control-plane

DDoS prevention #101