dotnet / aspnetcore

ASP.NET Core is a cross-platform .NET framework for building modern cloud-based web applications on Windows, Mac, or Linux.
https://asp.net
MIT License
35.41k stars 10k forks source link

SQL Server deadlock with Microsoft.Extensions.Caching.SqlServer #28380

Open epignosisx opened 5 years ago

epignosisx commented 5 years ago

We've been running into SQL Server deadlocks when using Microsoft.Extensions.Caching.SqlServer for Session State.

We went through this before in our ASP.NET apps. It comes down to this query:

https://github.com/aspnet/Caching/blob/c6a7f611fb537cbc58d9ac3bbe0b61fabe79ae8a/src/Microsoft.Extensions.Caching.SqlServer/SqlQueries.cs#L55

public const string DeleteExpiredCacheItemsFormat = "DELETE FROM {0} WHERE @UtcNow > ExpiresAtTime";

This query can easily escalate to a table lock. To reduce the potential of table locks the recommendation is to perform the deletes in batches. Ex:

SET ROWCOUNT 50
delete_more:
     DELETE FROM {0} WHERE @UtcNow > ExpiresAtTime
IF @@ROWCOUNT > 0 GOTO delete_more
SET ROWCOUNT 0

In ASP.NET apps this was very easy to change because these were stored procedures we could change ourselves. In ASP.NET Core the queries are in code and in classes that are not extensible (internal classes).

Potential for deadlocks is more common in ASP.NET Core than in ASP.NET apps because in ASP.NET apps the delete sessions query used to be a SQL Job performed by just one server, whereas in ASP.NET Core if we have 10 instances of the app running it will be 10 DELETE queries hitting the DB.

So my ask is that you make the queries easy to change. In the current state, we pretty much have to reimplement the Microsoft.Extensions.Caching.SqlServer if we want to change the queries. Maybe the queries can be overriden via SqlServerCacheOptions or at least just this one clean up query.

elvirdolic commented 4 years ago

What's the status of this issue? We are facing same problems with small dtu on azure sql

msmaverick2018 commented 4 years ago

We are curious to know the status of this issue. We are planning to utilize SQL Server for caching and concerned about running into this issue in production. Should we refrain from using this extension until this issue is resolved?

@epignosisx did you do any workaround for getting around this issue?

epignosisx commented 4 years ago

@msmaverick2018 we disabled this cleanup query (there is a setting you can tweak) and added the optimized query as a SQL job.

Ultimately we ended up switching to Redis. Our app used session state heavily and SQL was not able to keep up with 3500 active sessions. DB queries would start to queue and response time would go above 1 sec.

We spent a lot time trying to make SQL work for us since we had great success in the past with ASP.NET apps. We even forked the repo and modified it to support SQL Server in-memory tables (datetimeoffset datatype is not supported). It helped but it was not enough.

The reality is that this SQL Server provider is a very simple implementation compared to the SQL Server Session State provider of ASP.NET. I looked into the provider and you could tell a lot of time went into optimizing it.

Moving to Redis fixed all our issues. In our load tests we were able to reach 20K active sessions and Redis didn't break a sweat.

msmaverick2018 commented 4 years ago

@epignosisx thanks for responding and the details. Agree with you, at least configuration option should have been provided to override the default DELETE query. Also, if you have a web farm, then each application would fire delete request on 30 min interval to the DB. We do not have REDIS available yet in our infrastructure and we are planning to use SQL Server as an interim measure to cache data until we get there. Good to know that in memory tables also did not help, since we were thinking of investing some time in evaluating that approach. Our concurrent load would be around 50 req/sec.

One of the things i am curious about is also the table design for the cache, it has a wide primary key on which the clustered index is created. We plan to use guids as keys in our use case and as expected when i load data into the table the clustered index is around 99% fragmented. This would affect I/O and probably query performance, we will load test and evaluate. As of now, thinking of adding index rebuild to the cleanup job. Few queries: 1) Did you guys running into issues with index fragmentation and its impact on performance? 2) How often were you able to run the delete job (using batches) without running into the deadlock issue?

epignosisx commented 4 years ago

@msmaverick2018 We did not modified the primary key length because we were caching all sort of things with different key lengths. We modified the datetimeoffset columns to have precision 2 instead of 7. We found that much precision overkill. I would definitely encourage you to modify the table to match your needs.

We did not look into index fragmentation, our thought was that since this table had so much volatility it was going to be very hard to have healthy indexes.

We were running the delete query every 15 mins which is our average user session.

ghost commented 3 years ago

We've moved this issue to the Backlog milestone. This means that it is not going to be worked on for the coming release. We will reassess the backlog following the current release and consider this item at that time. To learn more about our issue management process and to have better expectation regarding different types of issues you can read our Triage Process.

davidfowl commented 3 years ago

I assume people are still running into this issue?