Deletes for all indices when publishing draft

Piedone commented 3 years ago

This might be more of a YesSql thing.

When you publish a content item's draft version then SQL DELETEs will be issued for the given document for all ContentItem indices, even if the item has nothing to do with it (like LayerMetadataIndex for a Page). You can see this e.g. when publishing an Article with a vanilla Blog recipe setup. Actually, if you change something on the item and then publish it then there will be a DELETE twice for all indices.

This is OK if you have a couple of index tables but this can add up quickly, especially with the mentioned duplication (in our custom app I've also seen the same index tables getting a DELETE not just twice but something three or even five times but I haven't yet pinpointed why). In our app we have about 60 indices thus this causes a significant performance penalty: Saving a Page (the same content type as in the Blog recipe) takes about half a second on my machine with a local DB, in production >5s.

Unless this is a bug in how Orchard uses YesSql then probably a good approach would be to "probe" the document somehow before deletion to see how many index providers are contributing to it and only issuing a DELETE for those that are affected. Or, at least bundle all the queries into a single SQL request so the latency of the server roundtrip is not incurred multiple times (as the queries themselves are quite trivial, especially for tables where there's no matching row, I'd guess this would be faster than doing the DELETEs one by one just for the affected indices.

I guess there's a similar issue with direct content item deletions too.

deanmarcussen commented 3 years ago

This is actually how YesSql works, by design.

Suggest you open an issue there if you want to suggest changes to that design.

One thing you can do (and we're only just starting to make the changes to the indexes in OC to support this), to improve performance is to include the DocumentId on a SQL index of the index table.

Refer https://github.com/OrchardCMS/OrchardCore/pull/7859 for some recent discussion on this.

I've also seen the same index tables getting a DELETE not just twice but something three or even five times but I haven't yet pinpointed why

This bit needs to be pinpointed

Piedone commented 3 years ago

Thanks for your quick reply, Dean.

Can you point me to something that sheds some light on why would this be by design? Deleting all indices when a given document is deleted I of course understand. I can also understand why one would implement this as a delete issued for all indices of the given document type though I think that can be optimized by checking whether that particular item actually has a given index, as I've explained above (possibly that can't be bulletproof though and it's better to just have all the deletes). However, I don't see why issuing these deletes one by one would make anything better. As apparent, these queries can add up and make all write operations slower by a lot; simply batching these up (meaning concatenating all DELETE queries into a single string, divided by semicolons) would eliminate the repeated penalty of the network round-trip and make this disappear unless there is a huge number of index tables.

deanmarcussen commented 3 years ago

Can you point me to something that sheds some light on why would this be by design?

Seb will describe the why, but when you look at how YesSql operates, it is intended to clear all indexes when saving. Otherwise it would have to make a query to know what is in these indexes, to do any comparison. Otherwise how would it know what might have changed?

However, I don't see why issuing these deletes one by one would make anything better

I'm not sure you're actiually getting roundtrips on this. I think they're just executes. But opening an issue on YesSql is the way to go here.

Now, what you could do since you have so many indexes (and a hint on the time it takes to process), and this would help me out with the evaluation of what indices are useful on these index tables.

Try adding a non clustered index with the DocumentId in it (which should be used for the deletes - you can see in profiler if it's been effective), on all of your index tables.

Something like

CREATE NONCLUSTERED INDEX [IDX_AutoroutePartIndex_DocumentId] ON [dbo].[AutoroutePartIndex]
(
    [DocumentID] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]

And let me know if that changes the performance at all?

I will also comment that we find document saves slow as well, so whatever we can do to improve it is good.

Piedone commented 3 years ago

Otherwise how would it know what might have changed?

That's the "probing" I mentioned but as I said later, I can accept that being error-prone. However, I don't see why these need to be individual queries; Mini Profiler shows them as such too, and it's apparent from the query time requirements that the network roundtrip to the DB server's penalty is incurred every time.

Without the roundtrip, the queries would be fine. The DocumentId columns do have SQL indices or the tables are really small. The queries themselves are mostly trivial, what makes this a real performance issue is that they're issued one by one.

I've opened a YesSql issue: https://github.com/sebastienros/yessql/issues/310 However, I'd keep this open as I don't think these queries should be issued twice for a content item save. Should I find the cause of why we sometimes see even more duplication for these deletions I'll post an update here.

jtkech commented 3 years ago

Related to #7859 and #7757 where we discussed about unclustered indexes, here some parts of the discussion

As i saw, if we have an unclustered index on A1, A2, An, the only way to get is used is when all the required columns for the produced SELECT OR for a produced JOIN are included in A1, A2, An

Then, and only if the above condition is true, in the WHERE / OR / AND parts, an unclustered index may be used if it includes some of the colums that are checked, and / or a clustered one for the other columns

So, when we do a QueryIndex(...) as the produced select is on all columns, unclustered indexes are never used and seem to be useless, unless we define an index including all columns

Then, when we do a query on Document with an inner join on LocalizedContentItemIndex, so on DocumentId, an unclustered index may be used if it includes DocumentId (Note: then see 2. above)

Yes you're right, but as i saw, this unclustered index including DocumentId is used when an index record is deleted, not when inserted with new values and then updated with the right DocumentId. But yes, still useful on updating as we also need to delete previous record(s).

One rule could be to include DocumentId, then all columns but excluding those of too large size. Hmm, unless most of the colums (e.g. all columns) are not used in a joined query, as for AutoroutePartIndex where it's better to just include DocumentId, knowing that when deleting an index record, only WHERE DocumentId is used.

So it's related to the CreateIndex() we do for index tables in migrations, where DocumentId is important for joined query from ContentItem to SomeIndexTable, and also when deleting content items. Without at least DocumentId, unclustered indexes are not used, and it falls back to a clustered scan

SchemaBuilder.AlterIndexTable<LocalizedContentItemIndex>(table => table
    .CreateIndex("IDX_LocalizationPartIndex_DocumentId",
        "DocumentId", // <= At least this one
        "LocalizationSet",
        "Culture",
        "ContentItemId",
        "Published",
        "Latest")
);

Maybe i will start a PR to update these indexes.

deanmarcussen commented 3 years ago

The DocumentId columns do have SQL indices or the tables are really small

Just checking that you are sure of that @Piedone .

I'm quite interested in the perf difference if they don't. Particularly as you have so many index tables, so it would be a great indicator of how much difference it would make.

On the YesSql issue you said it was the primary key being checked. It's not, on an index table, there is an Id, which is the primary key, and the DocumentId which is not a primary key, and not indexed by default.

So you would have had to make those indexes for them yourself.

Maybe i will start a PR to update these indexes. @jtkech yes!

Piedone commented 3 years ago

You're right, it's not a primary key indeed, I mixed it up, so scratch my comments about indices. We'll try SQL indices.

Piedone commented 3 years ago

I tried out with tables in our system (~3000 rows at most) and I didn't measure any significant performance advantage with an SQL index. I tried this in Azure SQL so maybe any difference is masked by the variance of the service's performance. I guess around 3000 rows is too small for an index to really have an effect. At this scale, the whole operation's performance is dominated by the base latency of the individual queries (so again, batching them up would pretty much eliminate the issue).

deanmarcussen commented 3 years ago

Thank you @Piedone that helps clear up how useful / priority that part of your issue is.

So @jtkech I figure worth doing when one of us has time

jtkech commented 3 years ago

@deanmarcussen @Piedone I already started something locally 2 days ago, will open a PR soon

@deanmarcussen i will do it step by step as we wil need to review carefully some interleaved migrations versions.

deanmarcussen commented 3 years ago

@deanmarcussen @Piedone I already started something locally 2 days ago, will open a PR soon

@deanmarcussen i will do it step by step as we wil need to review carefully some interleaved migrations versions.

@jtkech for infos.

I did a lot of performance tests against YesSql while we figured this out. With only a few items in the Document table, a DocumentId index on the MapIndexTable makes zero performance difference.

With 1,000 items in the Document table, and therefore 1,000 items in a MapIndexTable and a DocumentId index on the MapIndexTable, all the delete commands improve by about 50% on duration. (not round tripping, just milliseconds to execute)

Duration for deletes is a lot lower now due to all the batching improvements, but still worth doing I reckon.

jtkech commented 3 years ago

@deanmarcussen

With 1,000 items in the Document table, and therefore 1,000 items in a MapIndexTable and a DocumentId index on the MapIndexTable, all the delete commands improve by about 50% on duration. (not round tripping, just milliseconds to execute)

Cool !!!

Duration for deletes is a lot lower now due to all the batching improvements, but still worth doing I reckon.

Do you mean because of some recent changes on yesSql side?

but still worth doing I reckon.

Yes, anyway

deanmarcussen commented 3 years ago

Do you mean because of some recent changes on yesSql side?

Yes all commands now batched into one sql execution (so no more round trips).

jtkech commented 3 years ago

Cool 👍

OrchardCMS / OrchardCore

Deletes for all indices when publishing draft #7949