envoyproxy / java-control-plane

Java implementation of an Envoy gRPC control plane
Apache License 2.0
293 stars 136 forks source link

Review Locking in SimpleCache #111

Closed sschepens closed 3 years ago

sschepens commented 5 years ago

SimpleCache uses a ReentrantReadWriteLock for guarding Snapshot map and StatusInfo access, in a scenario where a server has thousands of clients and updates being made, this proves a bottleneck.

We measured our setSnapshot time to sometimes take up to 300ms, this could be, because other createWatches are locking and preventing this to happen, or this could be real time taken because we do have about 500k streams per control plane server. If the latter is true, no watches can be created for 300ms because the write lock is taken, this is definitely not good.

I imagine the most needed synchronisation need is to prevent watches from being created while a new snapshot is being set and being left with an old version. Though the current locking scheme is much more restrictive, one cannot even create two watches concurrently.

We should take a look at this and try to handle it a better way.

Some ideas on how to improve this:

More ideas are welcome, but it would seem this locking mechanism should be reworked to allow better performance, we're not even stressing our cpus right now.

Edit: Another idea would be to not drop watches everytime we respond to them, this would decrease the lock contention a lot, has anyone thought about this before?

slonka commented 3 years ago

PR is merged so I'm closing this, please reopen if I'm missing something.