Closed chemicL closed 5 years ago
I agree adding rate limiting of concurrent requests handled by java-control-plane is reasonable idea. Have you tried experimenting with rate limiting discovery requests from Envoy side using these settings https://github.com/envoyproxy/envoy/blob/02ed57d648d82c5e5ef72070b145d2a28c9b404c/api/envoy/api/v2/core/config_source.proto#L67?
Thanks for pointing that out. The idea here is to make the control plane resilient and not relying on any external configurations. If a "malicious" client does not configure rate limiting then we're still at risk. There's many approaches one can suggest, such as adding another Envoy in front of the control plane, but the idea here is simple - java-control-plane should simply be able to prevent itself from unexpected burst in queries.
I completely agree and not opposed to adding to control plane. Just pointing out that there is a feature in Envoy you can make use of from Envoy perspective as well. Relevant discussions in Envoy https://github.com/envoyproxy/envoy/issues/4718 and https://github.com/envoyproxy/envoy/issues/2169
Thanks for clarifying and referencing the discussions 👍 I see there's been suggestions for adding this feature to the control plane before.
Seems like a bit of this is things you can configure on the server itself, by setting various executor parameters or installing server side interceptors. The manual flow control suggestion definitely makes sense as being part of this code base though.
I'm keen to find out which of the proposed changes can be configured using executor parameters. We've been trying to figure out a smooth way to implement this but any changes to the executors lead to undesired behaviour. Only utilizing the HTTP2 flow control were we able to actually prevent the app from growing the RAM after being bombarded with requests. I haven't found a different way to handle back-pressure in a client friendly manner, but perhaps it can be done?
I'm closing this issue. Due to gaining more experience running our control plane here's the reasons this feature makes not much sense:
What we suggest instead is proper monitoring, alerting, and autoscaling of the control-plane.
Our experience building a control plane out of java-control-plane has shown that it's easy to overwhelm the service if there's many Envoys querying the control plane for many services.
Especially in the bootstrap phase, when deploying multiple Envoys, they won't be able to serve as a proxy if they don't get the initial configuration. Killing the control plane leads to even worse consequences, where the cluster cannot be stabilised at all. We dealt with the issues by whitelisting services and using ADS to limit the number of connections.
However, in case of some error on our side (or Envoy's), we might end up with far too many streams/requests being issued to the control plane than an instance can handle.
I'm suggesting adding three things:
The goal of the above is simply to make the SPoF component (which java-control-plane might end up being) resilient in spite of unexpected rise of demand. This also prevents cascading failures, where a misconfiguration in part of the infrastructure could kill the entire cluster.
I'm happy to submit a PR with these changes if you think this is something worth introducing here. Otherwise, we'd need to keep a fork/copy of DiscoveryServer.