Closed jakubdyszkiewicz closed 5 years ago
Merging #98 into master will decrease coverage by
0.58%
. The diff coverage is77.77%
.
@@ Coverage Diff @@
## master #98 +/- ##
============================================
- Coverage 94.47% 93.88% -0.59%
- Complexity 134 140 +6
============================================
Files 14 17 +3
Lines 525 540 +15
Branches 46 46
============================================
+ Hits 496 507 +11
- Misses 21 25 +4
Partials 8 8
Impacted Files | Coverage Δ | Complexity Δ | |
---|---|---|---|
...nvoyproxy/controlplane/server/DiscoveryServer.java | 96.58% <100%> (+0.08%) |
14 <1> (ø) |
:arrow_down: |
...er/serializer/DefaultProtoResourcesSerializer.java | 100% <100%> (ø) |
2 <2> (?) |
|
...ne/server/serializer/ProtoResourcesSerializer.java | 33.33% <33.33%> (ø) |
1 <1> (?) |
|
...ver/serializer/CachedProtoResourcesSerializer.java | 71.42% <71.42%> (ø) |
3 <3> (?) |
|
...va/io/envoyproxy/controlplane/cache/Resources.java | 76.92% <0%> (ø) |
16% <0%> (ø) |
:arrow_down: |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact)
,ø = not affected
,? = missing data
Powered by Codecov. Last update b29dc60...136f04e. Read the comment docs.
Well sure, we could also cache the list, but I think the protos are the main concern here because 1) It's large. One full proto is something about ~250KB in our case, so with 500 envoys = 500 protos = 125MB without caching and 250KB with. So with every instance change (let's assume every 5s) we allocate 125MB instead of 250KB. 2) Generating proto is not cheap.
It would also seem we don't really make use of the resources stored in latestResponse, does it make sense to keep them in memory?
The rest of the DiscoveryResponse
aside from the resources is fairly cheap to keep I think. The side benefit of keeping resources in memory is that we can cache it in Guava weak cache and be sure it is kept when sending the response across all Envoys.
With keeping resources in last response: 1) The response is sent to the first Envoy, it is kept in last response and stored in cache 2) GC comes in, but it does not wipe the cache because the reference is still in latestResponse 3) The response is sent to second envoy using cache ... 100) The response is sent to 300th Envoy and it's still using the cached proto
Without keeping resources in the last response: 1) The response is sent to the first Envoy, it is is not kept in last response and it is stored in cache 2) GC comes in, but it wipes the cache because nobody holds proto at the moment 3) The response is sent to second envoy and it has to be generated ... 100) The response is sent to 300th Envoy and who knows how many protos we generated.
I think we can try to optimise the list and latestResponse
, but I don't feel we will see huge improvement like with this one. How about in another task and PR?
Seems right, though we could probably want to be able to turn this off?
Sure, I can do this.
Updated the PR. The new behaviour is now optional.
@snowp @joeyb Could you take a look at this? :)
Thinking about this some more - should we instead just be storing the Any protos in the SnapshotCache directly? So instead of storing Message, then cache the Message -> Any conversion we store them as Any, and then provide a helper method to convert the internal Any back to a Message for a nicer API
Thinking about this some more - should we instead just be storing the Any protos in the SnapshotCache directly? So instead of storing Message, then cache the Message -> Any conversion we store them as Any, and then provide a helper method to convert the internal Any back to a Message for a nicer API
Interesting idea, but I think this is a huge change. To not break the API we have to provide conversion methods just like you said. What if someone relies on fast getting resources from Snapshot? Resources would have to be converted every time on getting them.
I'm not opposed to breaking APIs if there's a good reason behind it and I don't really think it's a big change. That said it is more intrusive than what you're proposing and it does affect the read perf (I don't think anyone cares, but who knows).
My main concern with this approach is the complexity of cache retention and how hard it is to test, but I think if you put down scary enough comments everywhere it could be okay. Worst case it falls back to the current behavior, so it's not high risk
Actually, I realized that we care about reading perf. We create the snapshot with all services (some of Envoys require that) and then use this snapshot to generate smaller snapshots for all the groups.
The weak values cache is only one of the implementation. We can also provide regular cache implementation with time retention or a fixed number of messages. If you are not comfortable with this implementation and feel that it's hacky, we can move it to our code. Of course, as long as there is an interface we can code to.
I'm okay with this approach due to the low risk failure mode as long as we properly document the intended behavior
I'll leave it up to you to decide whether you're okay with this being mostly untested - you'll be the one relying on this, not me.
Ok, so I'm willing to try. We've got this solution deployed for some time now and it works great for us.
I applied your code review comments, please take a look. Thanks!
During the java-control-plane performance debugging I've noticed that the proto with resources is built on every response which is then stored in the
lastResponse
. This is huge pressure on GC when you run java-control-plane with many Envoys that receives the same set of resources (for example: edge envoys).That is why I introduced Guava cache with
weakValues()
. So as long as anyDiscoveryRequestStreamObserver#lastResponse
will keep the value, it will be served from the cache. Otherwise it will be wiped with the next GC.I run the test when connected to 300 envoys on ADS on single instance of our control plane. Every envoy gets all the state of our discovery which is ~700 clusters and couple of thousands endpoints. Here is the result Before this change is the instance with green line. On the right is the instance with this change. As you can see, this is a huge improvement on the performance.
I think that this "feature" should be internal, that's why I've made
ProtoResourcesSerializer
package private and static. If you are afraid that it will break something I can change it to the interface, introduceDefaultProtoResourcesSerializer
which will keep current behaviour andCacheProtoResourcesSerializer
which will has the cache. Then add another parameter to theDiscoveryServer
constructor.Signed-off-by: Jakub Dyszkiewicz jakub.dyszkiewicz@allegro.pl