SimpleCache uses a ReentrantReadWriteLock for guarding Snapshot map and StatusInfo access, in a scenario where a server has thousands of clients and updates being made, this proves a bottleneck.
We measured our setSnapshot time to sometimes take up to 300ms, this could be, because other createWatches are locking and preventing this to happen, or this could be real time taken because we do have about 500k streams per control plane server. If the latter is true, no watches can be created for 300ms because the write lock is taken, this is definitely not good.
I imagine the most needed synchronisation need is to prevent watches from being created while a new snapshot is being set and being left with an old version. Though the current locking scheme is much more restrictive, one cannot even create two watches concurrently.
We should take a look at this and try to handle it a better way.
Some ideas on how to improve this:
CacheStatusInfo is internally guarded by a lock that doesn't seem very necessary, we should drop it and use a ConcurrentMap.
SimpleCache is the most tricky, but we could:
Use a readLock to guard createWatch since we don't really mind watches being created concurrently, this is kind of counter intuitive since we're using a read lock but actually writing data, but i've seen this kind of use before. Then use writeLock only in setSnapshot. This strategy would improve things a bit, but it doesn't really solve our problem.
We could probably readLock entire createWatch but writeLock only snapshots.put on setSnapshot. I have to think about the implications of this a little more, but it would seem that if we do that, there would be no conflicts since createWatch would always have a consistent view of snapshots. Currently creating watches would see and old version of snapshot but manage to create it's watches before setSnapshot can lock and insert the new snapshot, after that, all new watches actually see that version and should have no problemas, while setSnapshot continues to iterate through watches and responding accordingly.
We could almost remove all locks and provide another way to ensure that watches are being responded to correctly when new snapshots are being created concurrently. Maybe make createWatch double check if a new snapshot has been created.
More ideas are welcome, but it would seem this locking mechanism should be reworked to allow better performance, we're not even stressing our cpus right now.
Edit: Another idea would be to not drop watches everytime we respond to them, this would decrease the lock contention a lot, has anyone thought about this before?
SimpleCache
uses aReentrantReadWriteLock
for guarding Snapshot map and StatusInfo access, in a scenario where a server has thousands of clients and updates being made, this proves a bottleneck.We measured our setSnapshot time to sometimes take up to 300ms, this could be, because other
createWatches
are locking and preventing this to happen, or this could be real time taken because we do have about 500k streams per control plane server. If the latter is true, no watches can be created for 300ms because the write lock is taken, this is definitely not good.I imagine the most needed synchronisation need is to prevent watches from being created while a new snapshot is being set and being left with an old version. Though the current locking scheme is much more restrictive, one cannot even create two watches concurrently.
We should take a look at this and try to handle it a better way.
Some ideas on how to improve this:
CacheStatusInfo
is internally guarded by a lock that doesn't seem very necessary, we should drop it and use a ConcurrentMap.SimpleCache
is the most tricky, but we could:readLock
to guardcreateWatch
since we don't really mind watches being created concurrently, this is kind of counter intuitive since we're using a read lock but actually writing data, but i've seen this kind of use before. Then usewriteLock
only insetSnapshot
. This strategy would improve things a bit, but it doesn't really solve our problem.createWatch
butwriteLock
onlysnapshots.put
onsetSnapshot
. I have to think about the implications of this a little more, but it would seem that if we do that, there would be no conflicts sincecreateWatch
would always have a consistent view ofsnapshots
. Currently creating watches would see and old version of snapshot but manage to create it's watches beforesetSnapshot
can lock and insert the new snapshot, after that, all new watches actually see that version and should have no problemas, whilesetSnapshot
continues to iterate through watches and responding accordingly.createWatch
double check if a new snapshot has been created.More ideas are welcome, but it would seem this locking mechanism should be reworked to allow better performance, we're not even stressing our cpus right now.
Edit: Another idea would be to not drop watches everytime we respond to them, this would decrease the lock contention a lot, has anyone thought about this before?