OrchardCMS / OrchardCore

Orchard Core is an open-source modular and multi-tenant application framework built with ASP.NET Core, and a content management system (CMS) built on top of that framework.
https://orchardcore.net
BSD 3-Clause "New" or "Revised" License
7.25k stars 2.35k forks source link

Autoroute startup performance bottleneck #6238

Open Piedone opened 4 years ago

Piedone commented 4 years ago

The Startup class of Autoroute fetches all the indices in a single query on shell start:

var autoroutes = session.QueryIndex<AutoroutePartIndex>(o => o.Published).ListAsync().GetAwaiter().GetResult();
entries.AddEntries(autoroutes.Select(e => new AutorouteEntry(e.ContentItemId, e.Path, e.ContainedContentItemId, e.JsonPath)));

While this is no problem for small sites it'll become an issue with hundreds (and especially tens of thousands and more) content items using AutoroutePart. Then all those items' index records will be retrieved in one go. This can be even reached with a blog where you post periodically and especially with something like a news site (or just a large site with many content pages). This won't scale.

What we can do instead e.g. is to make AutorouteAliasProvider look up aliases lazily but caching results (even in a ConcurrentDictionary but more so in proper, even distributed cache like @jtkech is working on). This way while each lookup would incur a performance penalty of an SQL query first it would support an unlimited number of content items (available memory for caching permitting but the cache can manage that), have no impact on startup performance (or even eventually cause DB timeouts) and still provide the same performance for repeated lookups (which should cover all the popular content items).

On a related note I think the current implementation will break in a multi-server setup too, if multiple nodes are creating content items.

Also, AutorouteEntries is not thread-safe: It uses a non-concurrent Dictionary for caching, and while it synchronizes writes to it, it doesn't do the same for reads (concurrent reads are only fine while the dictionary is not modified at the same time).

sebastienros commented 4 years ago

Might be a config option to preload instead of doing it by default. We'd also need to lock on the resolution for a specific url (gating) to prevent bursts of SQL queries on startup, for the homepage for instance. It might also be an MRU collection in the case of lazy loading so we don't store routes that are rarely used.

Would you mind creating a site with thousands or autoroutes and measure startup? I assume the simplest way is with a SQL query, all autoroutes pointing to the same action.

jtkech commented 4 years ago

Also, AutorouteEntries is not thread-safe

I agree, it is like this from the beginning, refactored in my distributedCache branch where we don't resolve autoroutes on startup but on the first need (e.g. not the home), but still in one go when we first create the related cache, a volatile (only cached) distributed document.

Hmm, if we populate the document lazily, if there is no entry matching a given request path, how could we know that the entry doesn't exist or is not yet populated, without always doing a query?

We'd also need to lock on the resolution for a specific url (gating) to prevent bursts of SQL queries

How can we do this if on another side we don't want to use custom locking sytems? E.g. as in my old distributed branch where there is an ILock service providing tenant level named locks, and with a distributed implementation based on Redis.

It might also be an MRU collection in the case of lazy loading so we don't store routes that are rarely used.

Good idea

Piedone commented 4 years ago

I've tested the impact with this query on SQLite:

WITH RECURSIVE
  for(i) AS (VALUES(1) UNION ALL SELECT i+1 FROM for WHERE i < count)
INSERT INTO AutoroutePartIndex (DocumentId, ContentItemId, Path, Published, Latest) SELECT i, '4xkqkmeg8zx2y5r9jxensthvmx', 'home-page', 1, 1 FROM for

And this one in a local SQL Server instance:

DECLARE @i int = 0
WHILE @i < count
BEGIN
    SET @i = @i + 1
    INSERT INTO AutoroutePartIndex (DocumentId, ContentItemId, Path, Published, Latest) VALUES (26, '4xkqkmeg8zx2y5r9jxensthvmx', 'home-page', 1, 1);
END

Results for SQLite, rough averages:

For SQL Server:

I don't know how these would change if actual content items would be present. Perhaps there would be some difference due to the foreign keys all pointing to different documents.

So if you have a few items or a local DB this is fast. I'd imagine with a remote DB these go up significantly.

Hmm, if we populate the document lazily, if there is no entry matching a given request path, how could we know that the entry doesn't exist or is not yet populated, without always doing a query?

We'd cache negative results too. So first lookup request for a path comes, we look it up in SQL. Found? If yes, put the AutorouteEntry into the cache. Not found? Put an object representing "not found" into the cache (maybe a private subclass of AutorouteEntry just for this purpose?). On the next request we'll read the same result form the cache. If the content changes then the cache will be updated, but this happens already.

sebastienros commented 4 years ago

Your numbers seem to disprove the issue.

However there might be cases when we want one or the other. Not sure it's worth implementing a different logic right now, as even with 1M items the startup is totally reasonable, and adding lazy loading means

I am not too worried about these issues because there are other ways in the system to force queries or allocate things, and this should not take the app down, just slow it down.

Maybe we could at least abstract this part if someone needs to implement some custom logic.

I think it's great to have some numbers of specific loads. Next we need to do the same with 1M content items ;) And I assume that the way the index tables are used is really paying off.

jtkech commented 4 years ago

We'd cache negative results too. So first lookup request for a path comes, we look it up in SQL. Found? If yes, put the AutorouteEntry into the cache. Not found? Put an object representing "not found" into the cache (maybe a private subclass of AutorouteEntry just for this purpose?). On the next request we'll read the same result form the cache. If the content changes then the cache will be updated, but this happens already.

Good idea but the problem is that we map on the following, so the autoroute transformer is executed e.g. for all static assets, so we would have many negative results

        routes.MapDynamicControllerRoute<AutoRouteTransformer>("/{any}/{**slug}");
Piedone commented 4 years ago

I was curious whether this changes significantly in an actual production-like environment with a remote DB, so I've made some tests on Azure: 50 DTU (S2) standard tier SQL DB, B1 Web App.

So there seems to be a soft limit at 4M records (not that there are no corresponding content items, so the actual number is likely lower) during realistic scenarios without further tweaking.

However, there are some larger first request times (not sure how to call it): I didn't want to include the first startup after adding new rows because I thought that that's unrealistic for a production app (only added for the last three when I noticed it might mean something). However, I'm not actually sure what's the difference between this first request and the subsequent ones. I'd guess SQL Server does some caching but I don't know when this cache is invalidated. So as a worst-case scenario I also included these numbers in parentheses: Maybe this is the number you'd actually see more in production, maybe the other one. Interestingly, for 1M rows there was no difference (perhaps it's too large for a cache).

Note that these numbers (as the yesterday ones) are with zero load apart from the measurements. I also tried tenant restarts while the DB was around 750% DTU and got 30-60s with around 3M records.

Anyway, I stand corrected, contrary to how it looks like this is not an issue for quite a while. While technically the current implementation won't scale it's still good enough for even a large site unless on significantly worse hardware/network than the one I've used in Azure.

I don't have a good idea for your remark @jtkech .

deanmarcussen commented 4 years ago

Cool, interesting to have some actual performance figures.

For infos, it is trivial to use some of the Api tests to create and connect to a sql database, and create content items ad nauseam.

The integration tests currently use SqLite to create multiple tenant database, and you can then take that database from the Tests/bin folder and move it to App_Data and load it as an actual tenant.

Or connect the tests directly to a local or remote Sql Server and create multiple tenants with a million content items

I did some of this when testing recipes, and could push some testing code to a branch if you are interested in a more real life test.

It would certainly be interesting to see. Let me know if you have the time to test them, and I will push a branch.

Here's a link to one of the existing tests https://github.com/OrchardCMS/OrchardCore/blob/dev/test/OrchardCore.Tests/Apis/ContentManagement/DeploymentPlans/ContentStepLuceneQueryTests.cs

In terms of the actual numbers you have already tested a couple of thoughts.

But those startup times are still pretty reasonable.

I would be interested to see how Startup performs when it is recieving +500 request / minute, during site recycle. This is an area that has caused us issues with ASP.NET Core, but not something I can test with Orchard Core.

Piedone commented 4 years ago

I don't have more time to play with this at the moment but thanks for the info.

sebastienros commented 4 years ago

Should we close this issue?

Piedone commented 4 years ago

Let's keep it open. It is an issue, just not an immediate one and not for everybody.

BTW I'm less concerned about the time a startup takes but more about potential timeouts. Depending on your configuration this can reach a level where the tenant is permanently stuck in a startup timeout, failing to start up at all. On Azure the DB connection timeout is 30s (which I'd argue is actually too high for a web app, you really want incorrectly slow queries to fail faster) which is around the measurement of the 4M threshold. Eventually somebody will bump into this.