MatteoPiovanelli-Laser commented 4 years ago

@HermesSbicego-Laser and I have been stressing and tuning a few things, as you might have guessed by our recent issues and prs. That involved running several parallel users creating, updating and looking at content. We were able to identify and fix a few deadlocks that had been cropping up here and there in some of our installations. A class of deadlocks we have been seeing the most has involved locks on the index for the Primary Key (PK__Orchard) of a table when a transaction has an UPDATE followed by a SELECT involving the same table. Naturally, we only see the deadlock when two transactions are attempting that same series of operations at the same time in this order:

U1: Transaction1.UPDATE
U2: Transaction2.UPDATE
S1: Transaction1.SELECT
S2: Transaction2.SELECT

As far as I understand it: U1 takes an exclusive lock on a row in PKOrchard. U2 takes an exclusive lock on a row in PKOrchard. S1 is doing a Clustered Index Scan on PKOrchard, and asks for a shared lock: when it gets to whatever row U2 is holding, S1 waits. S2 is doing a Clustered Index Scan on PKOrchard, and asks for a shared lock: when it gets to whatever row U1 is holding, S2 waits.

This results in a deadlock, and SQL server kills one of the Selects, causing that entire transaction to fail.

We have had some luck on solving those deadlocks by adding Non-Clustered Indexes that have as keys whatever columns are being used in the WHERE statements for the SELECTs: the idea for this is that the select statements won't try to lock on the PK__Orchard, and everything will be fine.

Then came TermContentItem. The image, from a trace gathered in SQL server, is pretty typical for this kind of deadlock. Each transaction involved in the deadlock is doing

INSERT INTO Orchard_Taxonomies_TermContentItem (Field, TermRecord_id, TermsPartRecord_id) VALUES ({A}, {B}, {C})
UPDATE Orchard_Taxonomies_TermContentItem SET TermsPartRecord_id = {C} WHERE Id = {ID}
select termconten0_.TermRecord_id as col_0_0_ from Orchard_Taxonomies_TermContentItem termconten0_ where termconten0_.TermsPartRecord_id={C} and termconten0_.Field={A}

The table already has the index IDX_TermsPartRecord_id_Field, with the following Index key columns: TermsPartRecord_id and Field. Ideally, this would be enough for SQL, but it's not because the select statement is fetching TermRecord_id. We can create a new index that has those same keys and includes the column we are trying to fetch:

CREATE INDEX IDX_GetTermsForContentItem
  ON Orchard_Taxonomies_TermContentItem (TermsPartRecord_id, Field)
  INCLUDE (TermRecord_id)

This fixes the deadlocks for us, meaning we aren't seeing them anymore.

These solutions have some issues which we find grating.

We aren't sure SQL server will always choose to use the non-clustered indexes wherever we followed this approach, meaning that we have no guarantee that the deadlocks won't simply reappear some other time, when SQL server decides for a different execution plan.
TermContentItem is a good example of the fact that these non-clustered indexes seem to have to be covering for the query involved in the deadlock. In this case this basically means that the index is a full copy of the table, because it has within it all columns.
Since these indexes seem to have to be covering, they may end up being super-specific for the queries which we discover are involved in the deadlocks. Whenever something changes (for example those queries are changed) those indexes will become obsolete.

We need @sebastienros to pitch in with his opinion on those, and we would like for him to involve whoever else he feels might contribute on this topic. The focus is of course not TermContentItem specifically: we are interested in finding a general approach to this kind of issue to further improve what we can do. Our goal is to increase the number of authenticated users concurrently authoring and viewing content.

MatteoPiovanelli-Laser commented 4 years ago

Quick update. I just did more tests, and in some of them SQL decided to lock on the clustered PK_Orchard index for CommonPartRecord rather than on the indexes we added to that table in #8362 . I definitely would not want to force SQL to use the indices with a query hint, but that leads us to this kind of behaviour on its end.

MatteoPiovanelli-Laser commented 4 years ago

List of issues related to this work:

8354

8359

8361

8365

8370

8372

List of PRs:

8360

8362

8364

8366

8368

8369

8371

8373

We didn't see deadlocks necessarily for all of those, but they are related to our current batch of performance improvements.

imranazad commented 4 years ago

This is great analysis work @MatteoPiovanelli-Laser and thanks for compiling the list of issues and PRs.

In #8339 you made some comments about using NOLOCK on the following tables:

Orchard_Framework_ContentItemVersionRecord, Title_TitlePartRecord, Orchard_Framework_ContentItemRecord, Common_CommonPartRecord, Orchard_Framework_ContentTypeRecord

I don't agree with this approach. It's possible uncommitted data could be read which would make the data wrong. We need to think about the implications of using NOLOCK on these tables to understand the risks better.

For example if there are processes that are using the data from these tables and then doing some computation on that whilst a transaction rolls back it could could lead to incorrect results.

I suppose the risk would always be dependant on the context, say for example these tables are used in some sort of booking system where the total number of places available is updated, in this case using NOLOCK would be a bad idea. I can't think of a concrete example where the risk would be minimal, maybe somebody could chip in?

I don't know Orchard or the Orchard database well enough but I feel there should be some detailed documentation around how these tables are interacted with and then an informed decision should be made with the documentation acting as the rational and justification for using NOLOCK.

Otherwise it's very risky for people to go ahead and use these changes as there are just too many unknowns which makes it extremely difficult to test. Say for example Orchard is being used in a clinical setting where clinical information is being published it becomes even more of a concern.

I have a keen interest in safety and software systems take for example the tragedy of the Therac-25 which resulted in deaths due to concurrent programming issues. Forgive me if I sound over zealous, I work in healthcare and safety is always of paramount importance.

MatteoPiovanelli-Laser commented 4 years ago

Fair points.

The first thing I want to say is that using NOLOCK on those tables can be enabled/disabled from a config file that is (can be) tenant-specific. Meaning that on the same multitenancy I could have different tables to which I am applying the NOLOCK hint on different tenants. What this means is that an application/tenant developer/admin can make a decision about that based on their own specific requirements.

On the tenant we are stress testing we use NOLOCK on those 5 tables, AutoroutePartRecord and IndexingTaskRecord (the latter 2 have their own provider setting that, which you can also find on the dev branch). We are monitoring those tables as well as others, and checking within the application, to make sure data stays consistent.

Arguments I have in favor of applying the NOLOCK hint on those 7 tables:

they (especially ContentItemRecord, ContentTypeRecord, ContentItemVersionRecord and CommonPartRecord) are the tables that are most commonly involved in any operation on content in Orchard, both update and read. That's from my experience on our systems. Anytime a part of the application is taking a lock on those, it's bound to be getting in the way of some other part of the application.
In the context of using Orchard as "just" a CMS, I think this is a robust choice: the state of interest to a content item will generally not depend on the precise state of other content items, and the "normal" process of managing content doesn't do anything that would mess that up that I am aware of (we fixed a lot of that).
These NOLOCK hints can mean trouble with dirty reads if the application is using information from ContentItems for something "more" AND the code supporting that application is written with the assumption that everything will always go smoothly in the database. #8368 is an example where we fixed something like that (even though we did not have dirty reads there, specifically). Note that we do this exact thing quite a bit, and so far we haven't had applications do anything weird.

As I said, the NOLOCK hints can be disabled (except those two, and we should push a change to make those controllable by config file too).

That said, I don't think NOLOCK should just be used anywhere, because I am sure that will mess things up. I'd always rather have a solution that allows me to get around concurrency issues through code rather than with a setting on the database. But If I wanted that, for example on ContentItemRecord, I would have to ensure that nowhere in the code a ContentItem is fetched from the database after and in the same transaction as an update to any ContentItem, and I don't know that that is even possible.

yorek commented 4 years ago

Have you tried already to set the database in READ COMMITED SNAPSHOT mode?

Also are this three statements

INSERT INTO Orchard_Taxonomies_TermContentItem (Field, TermRecord_id, TermsPartRecord_id) VALUES ({A}, {B}, {C})
UPDATE Orchard_Taxonomies_TermContentItem SET TermsPartRecord_id = {C} WHERE Id = {ID}
select termconten0_.TermRecord_id as col_0_0_ from Orchard_Taxonomies_TermContentItem termconten0_ where termconten0_.TermsPartRecord_id={C} and termconten0_.Field={A}

all part of the same explicit transaction (BEGIN TRAN....COMMIT TRAN)? If yes, have you already tried to set the transaction level for that transaction to REPEATABLE READ?

MatteoPiovanelli-Laser commented 4 years ago

READ COMMITTED SNAPSHOT We did a couple years ago when we tried a bunch of other things. I don't remember what the issue was, but there were some.

The statements are all part of the same transaction, as per Orchard's model that attempts to have a single transaction to handle an entire request. Those statements are only a small part of the entirety of the transactions they were in. It would be hard to find a generic way to compute the "correct" isolation level for every and each transaction, I think.

yorek commented 4 years ago

Got it. I'll install Orchard and take a look at what happens behind the scenes as soon as I have some "free" time, I'm curious now :)

OrchardCMS / Orchard

Deadlocks and performances #8367

8354

8359

8361

8365

8370

8372

8360

8362

8364

8366

8368

8369

8371

8373