Optimistic Concurrency Lock Support in `IDistributedCache`

cilerler commented 1 year ago

In large-scale distributed systems with many instances, updating values in a distributed cache can be challenging due to the risk of concurrent writes. Multiple instances may attempt to modify the same cache entry simultaneously, leading to overwrites and potential data loss.

Describe the solution you'd like

To address this challenge, I'd like to request the addition of optimistic concurrency lock support to the IDistributedCache interface, especially for the Redis implementation (ideally for all). This feature will allow instances to detect and resolve concurrent write conflicts when updating cache entries.

Proposed Solution

Implement an optimistic concurrency lock mechanism in IDistributedCache that uses versioning or token-based checks to detect concurrent writes. If a write conflict is detected, the instance can either retry the update after re-reading the latest value or handle the conflict in a predefined manner.

The solution should also include a flexible conflict resolution strategy to handle scenarios where multiple instances try to update the same cache entry simultaneously.

Additional context

Real-Life Scenario

Let's consider an e-commerce platform that uses IDistributedCache with Redis to cache product inventory levels. The platform has 100+ instances, and each instance can update inventory levels based on user purchases.

Imagine the following sequence of events:

Instance A reads the inventory level for Product X from the cache, which is currently set to 10.
Instance B reads the inventory level for Product X from the cache, which is also 10.
Instance A updates the inventory level to 9 after processing a purchase of Product X.
Instance B updates the inventory level to 9 after processing another purchase of Product X.

In the absence of an optimistic concurrency lock mechanism, the final inventory level in the cache would be 9, even though two purchases were made. This can lead to stock discrepancies and ultimately impact the customer experience.

adityamandaleeka commented 1 year ago

@mgravell @ReubenBond

ReubenBond commented 1 year ago

I agree with the idea. I would suggest token based updates because a token-based API can support numeric versions opaquely.

Additionally, I suggest we do not implement internal retry or conflict resolution schemes and instead use Try* style APIs. Retry can be implemented at a higher layer or via a wrapper.

mgravell commented 1 year ago

A part of me is itching about this a little at the conceptual level, meaning: if "overwrites and potential data loss" are a concern, this is less of a pure cache and more of a secondary database layer. That doesn't mean I don't see the utility, though - I absolutely do.

Implementation wise, I don't think we can do anything useful with IDistributedCache here; it is impossible to push a new token API into all of the required methods, so IMO we'd really be talking about an entire new API - IDistributedCache2, IVersionedDistributedCache or something (naming is hard). And if we're doing that, it would be an opportunity to reconsider the API generally - for example, to avoid the byte[] problem (taking ReadOnlyMemory<byte> inwards, and ideally some kind of lease outwards - maybe IMemoryOwner<byte>, maybe a custom readonly struct : IDisposable).

All doable, and all with merit. But needs some thought.

johnkwaters commented 1 year ago

NCache, SQLServer and Redis all have the concepts of locks or leases - I think it would be a useful abstraction for a distributed cache to have a lock/release mechanism for reading and updating. For instance, in Redis you can do LockTake/LockRelease, with SqlServer sp_getapplock, sp_releaseapplock, and NCache cache.Lock/Unlock. How about adding these additional methods to IDistributedCache for managing locks, maybe also on override on Get and Set that also tries to acquire (Get) and release (Set) a lock? I know this is a pessimistic lock and not the optimistic lock the title of this post is asking for, but I feel strongly related.

mgravell commented 1 year ago

Current position is that we're going to try to look at a wide range of "distributed cache" overhaul things in net9, so I think this is destined for re-examination as part of that work

zhoucm666 commented 1 year ago

Please also consider batch caching

ndenkha commented 5 months ago

Currently, I'm using IDistributedCache and on .Net 8 and we're getting lots of SQL Server database blocking due to the numerous updates on every record in the state table. I was hoping such locking/versioning would've been part of .Net 8, but I see I'll need to wait and HOPE it will be in .Net 9. In the meantime, I plan to implement a custom DB caching mechanism to overcome all of these update calls causing the blocking. I hope this gets implemented in .Net 9.

YohanSciubukgian commented 5 months ago

Not sure if it fit your need, but there is a 3rd party library that manage these kind of use cases: https://github.com/madelson/DistributedLock

It might be an interesting entry point for aspnetcore api design as it support many implementations

mgravell commented 5 months ago

@ndenkha to confirm: you're using SQL Server as the cache backend, and seeing locking in the cache table, is that right? That is very intriguing, but that my indeed be helped by HybridCache in .NET 9, which should reduce traffic to the underlying cache backend. I'd be interested in a bit more context (maybe a typical example) of your cache usage, though.

If the locking is unrelated to cache and you simply wanted somewhere for centralized locking so you can reduce load on some other data: that isn't currently a feature; that might be something we can add to HybridCache, but there is no suitable API on IDistributedCache for a conditional set, i.e. "set if no value currently" and "remove if this value" (which are the minimum APIs you need for locking using a key/value store).

ndenkha commented 5 months ago

Hi @mgravell , Yes, we're using SQL Server for distributed caching, and when there are over 60k entries in the cache table (MyAppSessionStates) we start seeing lots of database blocking. This table is of course what asp.net spits out of the box. We see these update statements repeatedly getting executed and it gets worse when we have over 100k records in the table.

UPDATE [dbo].[MyAppSessionStates] SET ExpiresAtTime = (CASE WHEN DATEDIFF(SECOND, @UtcNow, AbsoluteExpiration) <= SlidingExpirationInSeconds THEN AbsoluteExpiration ELSE DATEADD(SECOND, SlidingExpirationInSeconds, @UtcNow) END) WHERE Id = @Id AND @UtcNow <= ExpiresAtTime AND SlidingExpirationInSeconds IS NOT NULL AND (AbsoluteExpiration IS NULL OR AbsoluteExpiration <> ExpiresAtTime);

SELECT Id, ExpiresAtTime, SlidingExpirationInSeconds, AbsoluteExpiration, Value FROM [dbo].[MyAppSessionStates] WHERE Id = @Id AND @UtcNow <= ExpiresAtTime;

To me, what is 100k records to cause blocking, right?

but then I'm also going to reduce the number of cached data, to help mitigate this issue.

This issue could be related to the SQL missing this statement: SET TRANSACTION ISOLATION LEVEL READ COMMITTED

Also, the reason why such updates occur so rapidly when our code isn't doing anything with the cached values.

ndenkha commented 5 months ago

Thanks, @YohanSciubukgian I took a brief look at this library and it reminds me of EF6's locking that we implemented in the past and it didn't improve things. With EF6 the generated SQL still didn't have the proper locking calls, and I'm worried this library would do the same thing. But, I might give it a test drive.

mgravell commented 5 months ago

@ndenkha got it; so: that's the sliding expiration being applied eagerly as part of the read; it isn't really related to optimistic concurrency, but is a very valid observation; IMO we can fix that trivially by simply reading the siding expiration into a variable during the read, and then only issuing the UPDATE if we even have sliding expiration, which many records: will not. I'm the person to own that.

mgravell commented 5 months ago

@ndenkha ^^^

mgravell commented 5 months ago

On the wider optimistic concurrency thing: I think there's a fundamental mismatch here between "cache" and "database". The situation presented in the top question, the problem isn't the cache - it is that the cache is being treated as the "real" database. IMO what should happen here:

an instance checks the cache to get an idea of the inventory level
having determined that there is probably stock, it proceeds to perform an actual stock allocation / reservation / whatever, and captures the actual remaining stock
it updates the cache with the actual remaining stock

There may be potential here to also store some optimistic token (rowversion, whatever) in the existing cache payload, potentially allowing some shortcuts during the stock allocation step.

Alternatively, there is a proposal somewhere that is more related to counters, i.e. interlocked (and range-checked) distributed increment/decrement. In that specific scenario, things probably make more sense for this use-case.

Realistically, I think it would be very hard to retrofit versioning onto the existing distributed cache API, as it is an abstraction geared around opaque blobs; adding a new fundamental feature there is very awkward, as the required atomic checked update is not universally available to cache providers. But as part of a new counters backend API? Much simpler.

cilerler commented 5 months ago

@mgravell, thanks for your insights. However, there seems to be some confusion between the example provided and the actual intent. The focus here is on caching, rather than using caching as a direct substitute for a database, which the example might have oversimplified. (You're right; I should present a pure caching scenario.) Nonetheless, the request remains valid.

Currently, I'm using LockTakeAsync from the Redis library to manage concurrency. This requires using the same IConnectionMultiplexer for all Redis operations and integrates IDistributedCache with locks. While this approach is specific to the Redis provider, similar methods could potentially be applied to others, ensuring a uniform implementation.

mgravell commented 5 months ago

LockTakeAsync is a different scenario again, as it is an exclusive "set if no current value"; this notably also isn't directly exposed via DC, but is different to optimistic concurrency which is more "update/delete/whatever if the value hasn't changed" - similar to the lock-extend on the SE.Redis API. But that isn't available in DC or raw redis (except perhaps via WATCH which is different again). A complex problem when the atomic operations needed aren't necessarily features of the available backends, and not in the existing contract. It would be nice if we could fit this into Hybrid Cache in a generalized way using the payload header, but even "check the first N bytes" isn't supportable in an atomic sense. We fundamentally need some new backend capabilities to support this. Which we can look at and consider, but it needs careful thought.

dotnet / aspnetcore