envoyproxy / java-control-plane

Java implementation of an Envoy gRPC control plane
Apache License 2.0
293 stars 136 forks source link

Clusters versioning #89

Closed jakubdyszkiewicz closed 5 years ago

jakubdyszkiewicz commented 5 years ago

Hello,

At Allegro, we are building our control plane on top of java-control-plane. We integrated it with Consul Discovery Service which on our dev environment has about 800 unique services and 2000 service instances. In our tests we observed issues with performance of the provided implementation.

A single instance of our control plane with 2GB of memory and 2 cores is struggling to handle 100 connected Envoys. We’ve diagnosed that the main problem is Snapshot versioning. Right now, there is one version for all endpoints. When even one service instance is changed (added/removed) the version of all endpoints is changed and all endpoints have to be sent to all envoys. This results in a huge spike in a gRPC executor in tens of thousands of jobs as well as creating a huge number of objects on the heap. Consequently, the GC runs so often that application can be stuck in GC for even 40% of the time.

We’ve introduced a PoC change in java-control-plane that lets us version every cluster in EDS individually. That means that if one instance is changed, we only sent the state of one cluster. After this change, our control plane can easily handle those 100 Envoys with the same machine with ~10% CPU usage.

An initial proposal for the change can be found at: https://github.com/jakubdyszkiewicz/java-control-plane/commit/e96f23afe4cb45698a53b58d394e4d8a524310c0

Can we prepare a proper PR from it? If you have suggestions how to solve it in a better way please let us know. We’d appreciate sharing your experiences with java-control-plane’s scalability.

snowp commented 5 years ago

Our production instances serve about ~1000 Envoys each with 4 cores and 4 GB memory, though endpoint changes are not that frequent because we use several different node groups and whitelist what services each Envoy is able to connect to, keeping each snapshot small. We've seen no perf issues so we can probably go higher than 1000.

As for your suggestion, it sounds reasonable as long as we make it possible to use the old method of versioning the entire EDS portion.

jakubdyszkiewicz commented 5 years ago

Although we finally went with ADS and small services subsets, I can see that @sschepens finished the work with https://github.com/envoyproxy/java-control-plane/pull/94, therefore I'm closing this issue.