Autoroute startup performance bottleneck

OrchardCMS / OrchardCore

Orchard Core is an open-source modular and multi-tenant application framework built with ASP.NET Core, and a content management system (CMS) built on top of that framework.

https://orchardcore.net

BSD 3-Clause "New" or "Revised" License

7.25k stars 2.35k forks source link

Autoroute startup performance bottleneck #6238

Open Piedone opened 4 years ago

Piedone commented 4 years ago

The Startup class of Autoroute fetches all the indices in a single query on shell start:

var autoroutes = session.QueryIndex<AutoroutePartIndex>(o => o.Published).ListAsync().GetAwaiter().GetResult();
entries.AddEntries(autoroutes.Select(e => new AutorouteEntry(e.ContentItemId, e.Path, e.ContainedContentItemId, e.JsonPath)));

While this is no problem for small sites it'll become an issue with hundreds (and especially tens of thousands and more) content items using AutoroutePart. Then all those items' index records will be retrieved in one go. This can be even reached with a blog where you post periodically and especially with something like a news site (or just a large site with many content pages). This won't scale.

What we can do instead e.g. is to make AutorouteAliasProvider look up aliases lazily but caching results (even in a ConcurrentDictionary but more so in proper, even distributed cache like @jtkech is working on). This way while each lookup would incur a performance penalty of an SQL query first it would support an unlimited number of content items (available memory for caching permitting but the cache can manage that), have no impact on startup performance (or even eventually cause DB timeouts) and still provide the same performance for repeated lookups (which should cover all the popular content items).

On a related note I think the current implementation will break in a multi-server setup too, if multiple nodes are creating content items.

Also, AutorouteEntries is not thread-safe: It uses a non-concurrent Dictionary for caching, and while it synchronizes writes to it, it doesn't do the same for reads (concurrent reads are only fine while the dictionary is not modified at the same time).

sebastienros commented 4 years ago

Might be a config option to preload instead of doing it by default. We'd also need to lock on the resolution for a specific url (gating) to prevent bursts of SQL queries on startup, for the homepage for instance. It might also be an MRU collection in the case of lazy loading so we don't store routes that are rarely used.

Would you mind creating a site with thousands or autoroutes and measure startup? I assume the simplest way is with a SQL query, all autoroutes pointing to the same action.

jtkech commented 4 years ago

Also, AutorouteEntries is not thread-safe

I agree, it is like this from the beginning, refactored in my distributedCache branch where we don't resolve autoroutes on startup but on the first need (e.g. not the home), but still in one go when we first create the related cache, a volatile (only cached) distributed document.

Hmm, if we populate the document lazily, if there is no entry matching a given request path, how could we know that the entry doesn't exist or is not yet populated, without always doing a query?

We'd also need to lock on the resolution for a specific url (gating) to prevent bursts of SQL queries

How can we do this if on another side we don't want to use custom locking sytems? E.g. as in my old distributed branch where there is an ILock service providing tenant level named locks, and with a distributed implementation based on Redis.

It might also be an MRU collection in the case of lazy loading so we don't store routes that are rarely used.

Good idea

Piedone commented 4 years ago

I've tested the impact with this query on SQLite:

WITH RECURSIVE
  for(i) AS (VALUES(1) UNION ALL SELECT i+1 FROM for WHERE i < count)
INSERT INTO AutoroutePartIndex (DocumentId, ContentItemId, Path, Published, Latest) SELECT i, '4xkqkmeg8zx2y5r9jxensthvmx', 'home-page', 1, 1 FROM for

And this one in a local SQL Server instance:

DECLARE @i int = 0
WHILE @i < count
BEGIN
    SET @i = @i + 1
    INSERT INTO AutoroutePartIndex (DocumentId, ContentItemId, Path, Published, Latest) VALUES (26, '4xkqkmeg8zx2y5r9jxensthvmx', 'home-page', 1, 1);
END

SQLite is not what you'll use for a prod app but still; It'll be slower than a proper DB server but also offers much lower latency than one that is across a network. SQL Server was a local instance so no network latency either. Run through IIS Express with just Ctrl+F5. Hence the funny SQL too.
Fresh site set up with Agency.
Switched the Forms feature on and off to get the shell to restart.
It's just my machine (TM) so absolute numbers are meaningless, just the ratios.
Measured the impact of both lines in the issue description, with the VS debugger (good enough for now).

Results for SQLite, rough averages:

Baseline, with a single Page: 2ms
10 rows: 3ms
100: 10ms
1000: 12ms
10k: 100ms
100k: 900ms
1M: 8.5s

For SQL Server:

Baseline: 2ms
10: 2ms
100: 2ms
1000: 5ms
10k: 30ms
100k: 400ms
1M: 3.5s

I don't know how these would change if actual content items would be present. Perhaps there would be some difference due to the foreign keys all pointing to different documents.

So if you have a few items or a local DB this is fast. I'd imagine with a remote DB these go up significantly.

Hmm, if we populate the document lazily, if there is no entry matching a given request path, how could we know that the entry doesn't exist or is not yet populated, without always doing a query?

We'd cache negative results too. So first lookup request for a path comes, we look it up in SQL. Found? If yes, put the AutorouteEntry into the cache. Not found? Put an object representing "not found" into the cache (maybe a private subclass of AutorouteEntry just for this purpose?). On the next request we'll read the same result form the cache. If the content changes then the cache will be updated, but this happens already.

sebastienros commented 4 years ago

Your numbers seem to disprove the issue.

However there might be cases when we want one or the other. Not sure it's worth implementing a different logic right now, as even with 1M items the startup is totally reasonable, and adding lazy loading means

we need to introduce some locking (gating) when resolving each route that doesn't exist
each potential 404 issues a db access, and is cache so even with an MRU will evict a valid entry, or if not would clog memory (DOS?)
maybe lookups, which will hapen more frequently when there are 1M, will be slower

I am not too worried about these issues because there are other ways in the system to force queries or allocate things, and this should not take the app down, just slow it down.

Maybe we could at least abstract this part if someone needs to implement some custom logic.

I think it's great to have some numbers of specific loads. Next we need to do the same with 1M content items ;) And I assume that the way the index tables are used is really paying off.

jtkech commented 4 years ago

We'd cache negative results too. So first lookup request for a path comes, we look it up in SQL. Found? If yes, put the AutorouteEntry into the cache. Not found? Put an object representing "not found" into the cache (maybe a private subclass of AutorouteEntry just for this purpose?). On the next request we'll read the same result form the cache. If the content changes then the cache will be updated, but this happens already.

Good idea but the problem is that we map on the following, so the autoroute transformer is executed e.g. for all static assets, so we would have many negative results

        routes.MapDynamicControllerRoute<AutoRouteTransformer>("/{any}/{**slug}");

Piedone commented 4 years ago

I was curious whether this changes significantly in an actual production-like environment with a remote DB, so I've made some tests on Azure: 50 DTU (S2) standard tier SQL DB, B1 Web App.

Baseline: 2ms
10: 2ms
100: 2ms
1000: 12ms
10k: 75ms
100k: 910ms (first request: 20s)
1M: 10s (first request: around the same)
For the joke of it I also checked 4M (couldn't wait for it to finish for 10): Around 40s but got several 504 timeouts too. (First request: 76s.)

So there seems to be a soft limit at 4M records (not that there are no corresponding content items, so the actual number is likely lower) during realistic scenarios without further tweaking.

However, there are some larger first request times (not sure how to call it): I didn't want to include the first startup after adding new rows because I thought that that's unrealistic for a production app (only added for the last three when I noticed it might mean something). However, I'm not actually sure what's the difference between this first request and the subsequent ones. I'd guess SQL Server does some caching but I don't know when this cache is invalidated. So as a worst-case scenario I also included these numbers in parentheses: Maybe this is the number you'd actually see more in production, maybe the other one. Interestingly, for 1M rows there was no difference (perhaps it's too large for a cache).

Note that these numbers (as the yesterday ones) are with zero load apart from the measurements. I also tried tenant restarts while the DB was around 750% DTU and got 30-60s with around 3M records.

Anyway, I stand corrected, contrary to how it looks like this is not an issue for quite a while. While technically the current implementation won't scale it's still good enough for even a large site unless on significantly worse hardware/network than the one I've used in Azure.

I don't have a good idea for your remark @jtkech .

deanmarcussen commented 4 years ago

Cool, interesting to have some actual performance figures.

For infos, it is trivial to use some of the Api tests to create and connect to a sql database, and create content items ad nauseam.

The integration tests currently use SqLite to create multiple tenant database, and you can then take that database from the Tests/bin folder and move it to App_Data and load it as an actual tenant.

Or connect the tests directly to a local or remote Sql Server and create multiple tenants with a million content items

I did some of this when testing recipes, and could push some testing code to a branch if you are interested in a more real life test.

It would certainly be interesting to see. Let me know if you have the time to test them, and I will push a branch.

Here's a link to one of the existing tests https://github.com/OrchardCMS/OrchardCore/blob/dev/test/OrchardCore.Tests/Apis/ContentManagement/DeploymentPlans/ContentStepLuceneQueryTests.cs

In terms of the actual numbers you have already tested a couple of thoughts.

With larger sites that have a slower startup time, we tend to have procedures for taking servers online and offline. (It depends on the environment). But particularly we use deployment slots to bring servers up, and get them started, before swapping them into live.
I suspect anyone with a million content items will likely have multiple servers, and some procedure in place, plus distributed will have the potential for some throttling capabilities I imagine , as it progresses further.

But those startup times are still pretty reasonable.

I would be interested to see how Startup performs when it is recieving +500 request / minute, during site recycle. This is an area that has caused us issues with ASP.NET Core, but not something I can test with Orchard Core.

Piedone commented 4 years ago

I don't have more time to play with this at the moment but thanks for the info.

sebastienros commented 4 years ago

Should we close this issue?

Piedone commented 4 years ago

Let's keep it open. It is an issue, just not an immediate one and not for everybody.

BTW I'm less concerned about the time a startup takes but more about potential timeouts. Depending on your configuration this can reach a level where the tenant is permanently stuck in a startup timeout, failing to start up at all. On Azure the DB connection timeout is 30s (which I'd argue is actually too high for a web app, you really want incorrectly slow queries to fail faster) which is around the measurement of the 4M threshold. Eventually somebody will bump into this.