OrchardCMS / OrchardCore

Orchard Core is an open-source modular and multi-tenant application framework built with ASP.NET Core, and a content management system (CMS) built on top of that framework.
https://orchardcore.net
BSD 3-Clause "New" or "Revised" License
7.37k stars 2.38k forks source link

Thread pool Starvation caused by sync over async pattern #13061

Closed ShaneCourtrille closed 1 year ago

ShaneCourtrille commented 1 year ago

We are currently working with Microsoft to resolve issues we're seeing with our production environment using OrchardCore. They've identified a problem with thread starvation caused by sync over async code within the OrchardCore.Data.YesSql.OrchardCoreBuilderExtensions.AddData method associated with this line.

image

The problem I can see is there doesn't appear to be a way to fix this as there isn't a way to make the call in the service provider an async one instead but I'm wondering if anyone else has an idea?

MikeAlhayek commented 1 year ago

@ShaneCourtrille just wondering about few things.

How many tenant are you running on the same instance? Have you check for latency between your web server and the database server? Is your database server responding fast or is there an issue with the connection?

@sebastienros StoreFactory is part of YesSql. Any ideas?

sebastienros commented 1 year ago

Did you mean to point to this code (easier than a screenshot to understand the context)? https://github.com/OrchardCMS/OrchardCore/blob/88405ae068da60181141bfd7bf1a3f28e4b08742/src/OrchardCore/OrchardCore.Data.YesSql/OrchardCoreBuilderExtensions.cs#L105

They've identified a problem with thread starvation caused by sync over async code within the

Maybe they didn't (identify the problem). This code is a singleton initialization. It's blocking one thread, it's run on the first request that accesses the database, so probably the first request to warmup the website. Then it's gone, never called again. Unless you have tons of tenants and they are all started at the same time, and you have a lot of concurrent calls to your website, then it's not the reason of thread pool starvation.

I can understand how someone might get to this conclusion, GetAwaiter().GetResult() is a bad pattern in a lot of cases, but here it's fine. And also ASP.NET configuration/initialization is not async, so everyone has to do it if you need to load things from the database where most APIs are async.

At what point of the app lifecycle is the issue occurring. And what does the log say about incoming requests when that happens (what kind of code is run)?

ShaneCourtrille commented 1 year ago

@sebastienros I haven't forgotten about this but I think we'll be contacting you outside of GitHub before I respond.

ShaneCourtrille commented 1 year ago

I'm trying to get additional details from Microsoft as well to see if they are seeing a lot of these calls in the dump they've analyzed. I'm starting to wonder if this is a "cold start" problem when many requests across many tenants hit a brand new instance.

sebastienros commented 1 year ago

I'm starting to wonder if this is a "cold start" problem when many requests across many tenants hit a brand new instance.

You should know when that happens, if it's during startup or after a while when the site is already warmed up.

mbarsott commented 1 year ago

Hello, I work on Shane's team and I am looking into this issue too. I know we have seen the high CPU issue at startup, because we witnessed it during a deployment, but the high CPU seems to happen at random times, not only startup. It may be a different problem when it happens at other times, we are not sure. We do have lots of tenants, but I don't feel comfortable disclosing the exact number, it may be confidential information.

mbarsott commented 1 year ago

I have posted this on the Gitter Orchard Core forum: We have experienced some events of 100% CPU utilization on our Orchard Core based application, on Azure App Service Linux containers. I was provided NetTrace and Memory Dump files for analysis. I could not figure much from the NetTrace file, other than something in the Microsoft.Extensions.DependencyInjection module was using almos all CPU.

On the dump file, VS reported that Microsoft.Extensions.DependencyInjection.OrchardCoreBuilderExtensions.AddDataAccess.AnonymousMethod_0_1 was causing 47 threads to perform asynchronous work, but waiting on synchronous calls to System.Threading.ManualResetEventSlim.Wait. Also according to the VS report, this can cause thread pool starvation and application hangs.

I looked into the AddDataAccess source code and noticed two usages of GetAwaiter().GetResult() on an anonymous method. The documentation for GetAwaiter says it should not be used on applications and it is destined to be used by the compiler only. Also, many articles on MSDN magazine and other blogs mention that the use of GetAwaiter().GetResult() can result in deadlocks and thread starvation. I believe this is the cause of our high CPU utilization in that specific episode.

This problem does not seem to be easily solved, as the code is calling asynchronous code from a synchronous method, and these method signatures cannot be easily changed because of the cascading effect of async/await propagation up the call stack. I studied the subject and came up with two alternative implementations. The first one uses Task.Run, Task.ContinueWith, and Task.WhenAll but I was not happy with it. The anonymous method needs to return an object that uses the results of the asynchronous methods, so before returning it, I had to use Task.Result, which I think still blocks the thread. The second approach uses Task.Factory.StartNew with TaskCreationOptions.AttachedToParent. In my understanding, this would be the closest I can get to async/await in synchronous code, as I understand it uses continuations in the same thread context of the parent thread (like await does).

Please let me know if it makes sense or if I am completely off the rails. If it makes sense I can fork the repo and create a PR.

MikeAlhayek commented 1 year ago

In ASP.NET, you can’t configure the services asynchronously. Configuring services is done on the tenant’s first request. Things to check for

  1. Is this method called more than once per tenant “for some unexpected reason”
  2. During the data configuration. Is there any connection validation happening? Our connection validator will open a database connection and query the Document table
  3. If the site is not configured, I think YesSql will create the document table.

If time permits, I’ll try to trace the step in which the DbValidation is done and at which point YesSql attempts to create the Document table. I am sure @sebastienros has good insight at that process if I don’t find time to dig into the steps

Skrypt commented 1 year ago

@mbarsott This AddDataAccess() method has normally only one call made per tenant. It is really unlikely to create the issues you are reporting unless this method is used somewhere else in your code. But, yes, if you want to create a POC in a Pull Request to illustrate your point it would be better.

mbarsott commented 1 year ago

@MikeAlhayek, I understand the service configuration happens on a request, and the required anonymous method to do that is expected to be synchronous. I also understand that no matter what I do in terms of creating threads, the result will need to be returned from the synchronous method, so the request will still be pending until everything is complete. What I am afraid of is that the GetAwaiter().GetResult() is being called directly on the asynchronous methods. My understanding is that the thread context on ASP.NET Core application requests is the threadpool context (please, correct me if I am wrong). So GetAwaiter.GetResult is not blocking any random thread, but rather the threadpool thread, which prevents threads from being created while the asynchronous method is not resolved. That is what I understood from the answer to this question: https://stackoverflow.com/questions/59018541/when-to-use-task-run-getawaiter-getresult-and-getawaiter-getresult. I am afraid if enough tenants are initialized concurrently, this can cause some type of deadlock or threadpool starvation. I was hoping to at least move the thread locking out of the threadpool context to avoid that.

MikeAlhayek commented 1 year ago

@mbarsott Lets assume you have 10k tenants running on the same instance and you restart the app. What happens is that the apps starts immediately. Now, "if every tenant make the initial request to their domain at the same time", you will have 10k requests to configure the 10k tenants. The call to GetAwaiter().GetResult() will keep the thread busy until the results are done "which should be super fast microseconds" because its just configuration and there is not requests made to external services like file system or database.

What I am saying here is that even if you have 10k+ events you won't have 10k configuration request at the same time "because each happen when the first request to the site comes in". Unless you find that the configuration step is doing actual blocking work like "database connection" open/query/close or opening a file in the file system.... this should not be an issue.

However, if you think you have a fix or a better way to do it, you can absolutely submit a PR that can be reviewed and possibly merged if it solve an actual problem.

mbarsott commented 1 year ago

This is the piece of the code with the two calls to GetAwaiter().GetResult(), directly on the async methods from the synchronous one, on OrchardCoreBuilderExtensions, lines 107 to 118:

var store = StoreFactory.CreateAndInitializeAsync(storeConfiguration).GetAwaiter().GetResult();
var options = sp.GetService<IOptions<StoreCollectionOptions>>().Value;
foreach (var collection in options.Collections)
{
    store.InitializeCollectionAsync(collection).GetAwaiter().GetResult();
}

var indexes = sp.GetServices<IIndexProvider>();

store.RegisterIndexes(indexes);

return store;

My first attempt on this was to change it to:

var options = sp.GetService<IOptions<StoreCollectionOptions>>().Value;
var indexes = sp.GetServices<IIndexProvider>();
var createAndInitializeTask = Task.Run<IStore>(async () => await StoreFactory.CreateAndInitializeAsync(storeConfiguration));
var continuationTask = createAndInitializeTask.ContinueWith(async (completedCreateAndInitializeTask) =>
{
    var store = completedCreateAndInitializeTask.Result;
    foreach (var collection in options.Collections)
    {
        await store.InitializeCollectionAsync(collection);
    }
    store.RegisterIndexes(indexes);
    return store;
});
var completedTasks = Task.WhenAll(continuationTask);
var store = completedTasks.Result[0].Result;
return store;

At least the asynchronous method was being called in a different thread, so it could resolve the problem of locking the threadpool thread. I was still not satisfied because I may still be locking the threadpool thread by calling Result on that context. I was happier with this second attempt:

var options = sp.GetService<IOptions<StoreCollectionOptions>>().Value;
var indexes = sp.GetServices<IIndexProvider>();
var createAndInitializeTask = Task.Factory.StartNew(async () =>
{
    return await StoreFactory.CreateAndInitializeAsync(storeConfiguration);
}, TaskCreationOptions.AttachedToParent);
var store = createAndInitializeTask.Result.Result;
Task.Factory.StartNew(async () =>
{
    foreach (var collection in options.Collections)
    {
        await store.InitializeCollectionAsync(collection);
    }
    store.RegisterIndexes(indexes);
}, TaskCreationOptions.AttachedToParent);

return store;

I step traced the code in the debugger and the results were what I expected, but I am still not sure I am not blocking the treadpool thread. Please let me know if my line of though makes sense and if you think any of these solutions could solve a potential problem with threadpool starvation.

mbarsott commented 1 year ago

If I am overthinking this, according to the response I mentioned above (https://stackoverflow.com/questions/59018541/when-to-use-task-run-getawaiter-getresult-and-getawaiter-getresult) it seems that at a minimum, it would be safer to replace the direct call to GetAwaiter().GetResult() for one wrapped in a Task.Run(). For example, replace var store = StoreFactory.CreateAndInitializeAsync(storeConfiguration).GetAwaiter().GetResult(); with var store = Task.Run<IStore>(async () => await StoreFactory.CreateAndInitializeAsync(storeConfiguration)).GetAwaiter().GetResult();

sebastienros commented 1 year ago

@mbarsott Can you try to change the code to var store = StoreFactory.Create(storeConfiguration) only?

This will assume that the database is already created though (the call to InitializeAsync creates the table). So you won't be able to create new tenants with that code, but that would suppress the GetAwaiter usage and would help us confirm it's the problem.

If that's the case we might try to find a way to call InitializeAsync in an async method later on in the middleware pipeline.

sebastienros commented 1 year ago

Another test would be to add a lock to prevent this code from being executed concurrently. With a static readonly object _synLock = new(); as the lock argument.

MikeAlhayek commented 1 year ago

@sebastienros I think I see the problem here. We currently the call StoreFactory.CreateAndInitializeAsync(storeConfiguration) actually opens a database connection to the server as you can see here https://github.com/sebastienros/yessql/blob/82c5579f4b2364af78c78f8a34b101d5152b8e8d/src/YesSql.Core/Store.cs#L118 since this is part of the service configuration, it has to be call synchronously which will cause the hold to the current thread until the connection is release "an attempt to create Document table for every collection". This will block current thread until the database work is done. If you have a slow database server, this can be a problem or if you have many tenants.

I think if we move CreateAndInitializeAsync() the call outside the configuration we can solve this issue. Maybe we only call Create(storeConfigs) instead of CreateAndInitializeAsync in the configuration, then keep track of the store state. Before executing any query, if the Store state is uninitialized ", we call InitializeAsync() first then resume the query. This would allow us to call InitializeAsync() using await since it is being called from outside the IoC configuration to avoid this issue.

MikeAlhayek commented 1 year ago

Funny, I just saw @sebastienros comment and it is exactly what I suggested in my previous comment. Tracking Store status will allow us to tell when we should call InitializeAsync() which will avoid this issue.

mbarsott commented 1 year ago

Sorry, I got stuck on other subjects and could not test the suggestions today. I will do it tomorrow.

MikeAlhayek commented 1 year ago

@sebastienros https://github.com/sebastienros/yessql/pull/473 may solve this problem. Can you please allow the PR workflow to process to see if all tests will pass? Also, can you please review the changes?

mbarsott commented 1 year ago

I looked at the changes, I think it solves the problem. It seems the PR is failing on Unit Tests, though.

MikeAlhayek commented 1 year ago

@mbarsott PR #13147 was merged. Can you please test out the latest preview and let us know if this fixed your actual issue?

ShaneCourtrille commented 1 year ago

@MikeAlhayek Is it possible for InitializeYesSqlServices to be called multiple times against the same store now? Thinking of a scenario where within a single scope there are multiple calls to ShellContextFactory.CreateDescribedContextAsync? All calls in that scope share the same IStore but it's already been initialized.. will this cause any problems?

MikeAlhayek commented 1 year ago

I don't think so. @jtkech are you able to confirm?

jtkech commented 1 year ago

Normally that's okay as any ShellContext (related to a given tenant) is built atomicaly, see below the usage of an async semaphore, and the call to CreateShellContextAsync() that will call the shell context factory CreateDescribedContextAsync().

https://github.com/OrchardCMS/OrchardCore/blob/6ab2a9c907028c3323903816bdfaa16972f7be53/src/OrchardCore/OrchardCore/Shell/ShellHost.cs#L80-L104

ShaneCourtrille commented 1 year ago

@jtkech What about these other direct callers of the CreateDescribedContextAsync() method?

DatabaseShellContextFactoryExtensions.GetDatabaseContextAsync SetupService.SetupInternalAsync DistributedShellHostedService.CreateDistributedContextAsync

ShaneCourtrille commented 1 year ago

@MikeAlhayek I'm still debugging but seeing an exception where I have _session.Store.TypeNames is null inside of YesSql while running a health check query (so host/Default tenant level) which accesses the OpenIdApplicationStore.ListAsync() method. I'll keep looking at it to see if I can figure out why that's happening.

MikeAlhayek commented 1 year ago

@ShaneCourtrille maybe something else is your problem. Check and make sure Store itself is not null for some odd reason since TypeNames is set during the initialization of the store.

jtkech commented 1 year ago

DistributedShellHostedService.CreateDistributedContextAsync is used when the related "distributed tenants" is enabled, it is a host service so for a given running instance it is not executed per request.

DatabaseShellContextFactoryExtensions.GetDatabaseContextAsync is only used when we configure the app to retrieve tenant settings from the database, for a given tenant loading and saving these settings are also done atomically in ShellSettingsManager.cs.

SetupService.SetupInternalAsync() is used on setup, so for a given tenant not intended to be called concurrently ;)

jtkech commented 1 year ago

So normally seems to be Okay, but yes need to be tried to see how it now behaves.

About TypeNames being null, maybe a new use case but need to look at it.

jtkech commented 1 year ago

Yes, as @MikeAlhayek said, and Store may be resolved as null if (see below)

services.AddSingleton(sp =>
{
    var shellSettings = sp.GetService<ShellSettings>();

    // Before the setup, a 'DatabaseProvider' may be configured without a required 'ConnectionString'.
    if (shellSettings.State == TenantState.Uninitialized || shellSettings["DatabaseProvider"] == null)
    {
        return null;
    }
mbarsott commented 1 year ago

Everything seems to be working fine on our application after this change, except for our custom health check endpoint. In that health check, we call multiple custom implementations of IHealthCheck. One of them checks if OpenId is working as expected. In order to do that, we inject an IOpenIdApplicationManager in the Health Check class. When our Health Check class calls ListAsync on the OpenIdApplicationManager, a NullReferenceException is thrown in the YesSql OpenIdApplicationStore, as the YesSql session it is using has a store with TypeNames set to null. The health check was working before this change. I don't understand how the session is not configured, if we get it from the YesSql store, which is a Singleton in the DI container. As far as I understand, the health check runs on the default tenant context, so it should have been properly initialized. I debugged the call to the health check endpoint, and the Add Scoped lambda function to register the session is called as expected, but the ServiceProviderEngineScope passed as the parameter to the lambda function has IsRootScope set to false. I would expect it to be set to true.

jtkech commented 1 year ago

@mbarsott

So after building a shell context (atomically) and before registering it (to be shared be other requests) we now do an async initialization (what we can't do in the registered implementation factory of a service).

Hmm, maybe a non awaited async call on your side, that would explain that somewhere ISession is resolved by resolving itself IStore but before its new async initialization was completed.

Not sure at all ;) But can you check it on your side? Or show us some of your code.


Just for info in case you can try it, I suggested a more global way in #13169 allowing to do the same for any service but directly in a module startup or with a new OrchardCoreBuilder extension as below.

builder.ConfigureServices(services =>
{
    ...
    // Configuring data access
    services.AddSingleton(sp =>
    {
        ...
        var store = StoreFactory.Create(storeConfiguration);
        ...
    });
    ...
})
.InitializeServices(async serviceProvider =>
{
    var store = serviceProvider.GetService<IStore>();
    if (store == null)
    {
        return;
    }

    await store.InitializeAsync();
    ...
});
mbarsott commented 1 year ago

Ok, it seems we finally understand the problem. We created a Health Check module that is added to the pipeline with a very high order (to be loaded after everything else). The problem is that when we reach the endpoints for this health check module, it does not go through the tenant resolution module, so a different DI container, that has not been initialized, is used. This seems to be because the Health Check module registers the endpoint in the middleware pipeline.

//Startup.cs for the health check module:
//...
        public override void ConfigureServices(IServiceCollection services)
        {
            var sp = services.BuildServiceProvider();
            var appSettings = sp.GetRequiredService<AppSettings>();            
            // ...
            services.AddHealthChecks()
                // ...
                .AddTypeActivatedCheck<OpenIdClientsHealthCheck>("OpenId Clients", HealthStatus.Unhealthy, new List<string> { DetailsEndpoint, DefaultShell }, appSettings, _httpContextAccessor, clock, loggerFactory, tenantTelemetryService, sp)
               // ...
               ;
        }
        public override void Configure(IApplicationBuilder app, IEndpointRouteBuilder routes, IServiceProvider serviceProvider)
        {
            app.UseHealthChecks("/health/details", new HealthCheckOptions
            {
                Predicate = healthCheck => FilterDetailsHealthCheck(healthCheck) && FilterDefaultShellHealthCheck(healthCheck),
                ResponseWriter = ResponseWriter.Write
            });
            // ...
        }
//...

When we reach the /health/details endpoint, and stop at a breakpoint, we see that there is no tenant resolution module in the call stack in order to use the context of the Default tenant, that has the Singleton services initialized. I am still trying to figure out if there is a way to handle that request in the Default tenant context.

Piedone commented 1 year ago

Interesting. The built-in module has something very similar for registrations, which should use the usual tenant-level container.

jtkech commented 1 year ago

Will look at it when I will have time.

Anyway in the meantime, not recommended at all to do the following in ConfigureServices() ;)

var sp = services.BuildServiceProvider();
mbarsott commented 1 year ago

Unfortunately, it seems that this change did not affect only the OpenId Store health check. We have background services running on our servers, and since I applied this change into our codebase those are also failing. In theory, it seems to be affecting anything that relies on DI and is not a "tenant endpoint". I am reverting this PR in our codebase for now. This also is on the YesSql module AddDataAccess method as we can see in this partial stack trace from our logs:

   at Microsoft.Extensions.DependencyInjection.ServiceLookup.ThrowHelper.ThrowObjectDisposedException (Microsoft.Extensions.DependencyInjection, Version=6.0.0.0, Culture=neutral, PublicKeyToken=adb9793829ddae60)
   at Microsoft.Extensions.DependencyInjection.ServiceLookup.ServiceProviderEngineScope.GetService (Microsoft.Extensions.DependencyInjection, Version=6.0.0.0, Culture=neutral, PublicKeyToken=adb9793829ddae60)
   at Microsoft.Extensions.DependencyInjection.ServiceProviderServiceExtensions.GetRequiredService (Microsoft.Extensions.DependencyInjection.Abstractions, Version=6.0.0.0, Culture=neutral, PublicKeyToken=adb9793829ddae60)
   at Microsoft.Extensions.DependencyInjection.ServiceProviderServiceExtensions.GetRequiredService (Microsoft.Extensions.DependencyInjection.Abstractions, Version=6.0.0.0, Culture=neutral, PublicKeyToken=adb9793829ddae60)
   at Microsoft.Extensions.DependencyInjection.OrchardCoreBuilderExtensions+<>c.<AddDataAccess>b__0_4 (OrchardCore.Data.YesSql, Version=1.5.0.0, Culture=neutral, PublicKeyToken=null)
jtkech commented 1 year ago

Would need more info on your code, looks like you are resolving a tenant service while not under a tenant container. Just looking again at your above code, you should not use BuildServiceProvider() in ConfigureServices, normally things should be configured in Configure() where a tenant scoped service provider param is passed, or by using the IOptions pattern, for example services.Configure<SomeOptions>(o => o...).

MikeAlhayek commented 1 year ago

@mbarsott I suggest that you open up a new issue since this one is closed. Feel free to reference this issue there.

mbarsott commented 1 year ago

Thank you @jtkech , I will try out your suggestions on a local environment, after all this may be only a problem with our code.

mbarsott commented 1 year ago

@MikeAlhayek , I will investigate further and if I find there is a problem with the OC implementation I will create a new issue. Thank you.